Unimpressive performance of large MD raid

Discussion in 'Storage' started by kkkk, Apr 22, 2009.

  1. NTFS unsafe in case of power loss?

    User data is not protected by the journaling.
    Depends on scenario. With >2000 files per directory, things do change - FAT uses linear directories, and NTFS uses B-trees similar to database indices.
    Maxim S. Shatskih, Apr 24, 2009
    1. Advertisements

  2. kkkk

    David Brown Guest

    Google for "linux raid 5" - there are a few million hits, most of which
    are for software raid (i.e., MD raid). Googling for "linux raid 6" only
    gets you a few hundred thousand hits.
    Here is a link that might be useful, if you want to know the details of
    Linux raid 6:

    David Brown, Apr 24, 2009
    1. Advertisements

  3. kkkk

    calypso Guest

    You are too detail... Engineer, right? :)
    Very similar compared to RAID1 and RAID0... Read something between the
    lines, concept is what matters, not details... RAID3 uses parity disk just
    as the RAID5 does, but RAID5 uses 'virtual' distributed parity disk... It's
    totally different than RAID1 concept...
    8+2 drives are 10... 16+2 drives are 18... 8 drives are optimal...

    BTW., in EMC CLARiiON storage arrays RAID3 can only be installed as 5 or 9
    Fine, but, if everything is aligned with Base2, then why go around it? RAID5
    of 9 drives is better than the one with 8 drives (7+1), and both can be
    used... But one is better aligned than the other... Or you have something
    contrary to say now again?
    Norman Ken Ouchi at IBM was awarded a 1978 U.S. patent 4,092,732[19] titled
    "System for recovering data stored in failed memory unit." The claims for
    this patent describe what would later be termed RAID 5 with full stripe
    writes. This 1978 patent also mentions that disk mirroring or duplexing
    (what would later be termed RAID 1) and protection with dedicated parity
    (that would later be termed RAID 4) were prior art at that time.

    The term RAID was first defined by David A. Patterson, Garth A. Gibson and
    Randy Katz at the University of California, Berkeley in 1987. They studied
    the possibility of using two or more drives to appear as a single device to
    the host system and published a paper: "A Case for Redundant Arrays of
    Inexpensive Disks (RAID)" in June 1988 at the SIGMOD conference.[20]

    Yup, you're right... In that way, since you have to be right every time, I
    will say that Patterson's picture describes the RAID5 small-write penalty...

    Satisfied now?

    "Optuzens li zvijezdau pasiru ?" upita kamiona nabija krastavaco izbacuje.
    "Ne znam ja nista !" rece kakaoa pjeva "Ja samo prozorciceg gladija bradatm !" By runf

    Damir Lukic, [email protected]_MAKNIOVO_fly.srk.fer.hr
    calypso, Apr 24, 2009
  4. kkkk

    kkkk Guest

    This guy


    is doing basically the same as I am doing with software raid done with
    ZFS in freebsd (raid-Z2 is basically raid-6) writing and reading 10GB
    files. His results are a heck of a lot better than mine with defaults
    settings and not very distant from the bare hard disks throughput (he
    seems to get about 50MB/sec per non-parity disk).

    This tells that software raid is indeed capable of doing good stuff in
    theory. Just linux MD + ext3 seems to have some performance problems :-(
    kkkk, Apr 24, 2009
  5. I did not check the kernel code, but logically writing to /dev/null
    you do not need to copy data. So I normally I would expect 2 times
    more copying. I would try bs parameter to dd, for example
    on my machine

    dd if=/dev/zero of=/dev/null count=1000000

    needs 0.560571s while

    time dd if=/dev/zero of=/dev/null count=100000 bs=10240

    (which copies twice as much data) needs 0.109896.

    By default dd uses 512 byte block which means that you do a lot
    of system calls (each block is copied using separate call to
    read and write).

    And yes, when dd is doing system call work done in kernel is
    accounted as work done by dd. That includes many operations
    done by ext3 (some work is done by kernel treads and some is
    done from interrupts and accounted to whatever process is
    running at given time).

    Coming back to dd CPU usage: as long as there is enough space
    to buffer write dd should have 100% CPU utilization. Simply,
    dd is copying data to kernel buffers as fast as it can. Once
    kernel buffers are full dd should block -- however what you
    wrote suggest that you have enough memory to buffer whole
    write. Using large blocks dd should be faster than disks,
    but for small blocks cost of system calls may be high
    (and it does not help that you have many cores, because
    dd is single threaded and much of kernel work is done
    in the same thread).
    Waldek Hebisch, Apr 24, 2009
  6. kkkk

    calypso Guest

    Well, coin has two sides, right? I've understood it the way I described
    it... You had pretty good arguments, and made me learn something more (be
    sure that I'll save this article somewhere)... :)

    TNX... ;)
    So, you say that it doesn't matter how many drives are in the array (RAID5
    or RAID6)? If so, that's nice, but I would like to know why exactly... Will
    read again your post...
    So basically, if I say that stripe segment is 64kb, it means that when I
    write 512kb of data and have 12 drives, I simply use 8 drives at a time, and
    the rest of 4 drives are not used (forget about parity drives now)?

    What happens if I try to write 1kb data in a 64kb stripe segment using 4kb
    blocks in NTFS (let's do this as an example)?
    Well, seems that I have found someone who trully understands how RAID works,
    so, I won't hesitate to ask what is still unknown to me... But first I had
    to make you angry... ;)
    Basically, what I understood is that IBM invented RAID, but Patterson and
    his crew gave it a name when they used inexpensive drives (IBM's storage
    surely costed much at that time?)...

    Pod krevetom se za pet minuta maslinov banderao cvokoce.
    By runf

    Damir Lukic, [email protected]_MAKNIOVO_fly.srk.fer.hr
    calypso, Apr 25, 2009
  7. kkkk

    calypso Guest

    So if you align cache page size with stripe size, you can benefit from it or
    no? Let's say that you've got 16kB cache page size and have 8 drives with
    2kb stripe segment size... If you dump cache, you basically write to all
    drives at once, right? But this situation can slow down everything since
    you've got how many IOPS per one write operation (>8)?
    Cool optimization...
    Thinking..... So, you need to optimize the cache of a RAID controller to
    gather changed data so that it could be written in one dump (utilize all
    actuators at once)?
    Cool... Anyway, sorry... It's almost impossible to find this kind of
    information on the internet... And I mostly work with EMC storage arrays
    (Symmetrix and CLARiiON), and get into detail as much as I can, but even
    with the access to information I don't get this deep... This is mostly
    information for firmware programming...

    Thanks a lot for explanations...

    "Naklonjens li jabukau pasiru ?" upita cigan mrcvari Crnogorkaog
    mrcvarija. "Nisam ja nikog bombardiro !" rece drota masira "Ja samo
    dostavljacu njuku zdravm !" By runf

    Damir Lukic, [email protected]_MAKNIOVO_fly.srk.fer.hr
    calypso, Apr 26, 2009
  8. kkkk

    calypso Guest

    How do you think about using RAID3 for multimedia broadcasting, and what
    about multimedia recording?
    OK, much concurrency is solved using TCQ/NCQ which means you've got only one
    IO operation for fetching few data segments per drive...
    I see... So, it's possible to see that 5 1TB/7.2k drives in RAID5 with a
    huge segment size can work faster than 15 146GB/15k drives in RAID5 with
    small segment size?

    "Bradats li zidaro zvace ?" upita tabletaa izdrkava gaceog gnjecija.
    "Nisam ja nikog bombardiro !" rece sprajt ljubi "Ja samo cokoladaog volija krvavm !" By runf

    Damir Lukic, [email protected]_MAKNIOVO_fly.srk.fer.hr
    calypso, Apr 26, 2009
  9. kkkk

    Guy Dawson Guest

    What's your block size for dd? I'm guessing it's the default 512bytes
    from your figures above. So you're doing lots of little writes.

    What happends with a much bigger block size such as 1MB or more?

    -- --------------------------------------------------------------------
    Guy Dawson I.T. Manager Crossflight Ltd
    Guy Dawson, Apr 27, 2009
  10. kkkk

    Guy Dawson Guest

    The key line in that link is

    dd bs=1m for a 10GB file.

    Note the 1MB block size setting for his test.

    Waldek Hebisch's post makes the same point about block size too.

    -- --------------------------------------------------------------------
    Guy Dawson I.T. Manager Crossflight Ltd
    Guy Dawson, Apr 27, 2009
  11. Try to use /dev/md/X directly as target for dd, to keep filesystem overhead
    out of your measurement. (Please note that your filesystem will be destroyed
    by this.)
    Patrick Rother, Apr 27, 2009
  12. kkkk

    kkkk Guest

    Hi everybody,

    Thanks for your suggestions

    I have seen the suggestions by Guy and Patrick to raise the bs for dd. I
    had already tried various values for this, up to a very large value, and
    I even tried the exact value of bs that would fill one complete RAID
    stripe in one write: no measurable performance improvement.

    Regarding trying to put the partition as data=writeback, I will try this
    one ASAP (possibly tomorrow: I need to find a moment when nobody is
    using the machine).

    Regarding trying to dd directly to the raw block device, I will also try
    this one ASAP. Luckily I have an unused LVM device located on the same
    MD raid 6.

    stay tuned... check back in 1-2 days.

    Thank everybody for your help
    kkkk, Apr 28, 2009
  13. kkkk

    kkkk Guest

    Hi all,
    based on your suggestions I have been testing lots of stuff: xfs, ext2,
    raw lvm device reads and writes, effect of bs in dd, disk scheduler,
    noatime... Interesting stuff is coming out, and not all good :-(
    I will post details tomorrow.
    Thank you.
    kkkk, Apr 30, 2009
  14. kkkk

    kkkk Guest

    Hi all
    here are details. Actually most news are not good, but I appear to have
    found one bottleneck: LVM, see below.

    I confirm numbers the given previously, plus: (the following benchmarks
    are measured are with all caches on, both for disks on mobo-controllers
    and for disks under 3ware)

    - Ext2 is not measurably faster than ext3 for this sequential write
    - Xfs is faster on first write (147MB/sec vs 111 MB/sec) but about same
    speed on file rewrite (~185MB/sec)
    - Writes directly to the raw LVM device located on the same MD raid-6
    are AS FAST AS FILE REWRITES at ~183MB/sec!! So this maximum speed is
    not an overhead of the filesystem. During the direct LVM write: the
    md1_raid5 process runs at 35%-50% CPU occupation, the pdflush process
    runs at 40-80% CPU occupation, dd itself runs at ~80% CPU occupation,
    plus all cores are about 1/3 busy in kernel code (which gets accounted
    to these 3 and 5 more running processes) and I guess this means they are
    servicing disk interrupts.

    Mounting with noatime for ext3 does not improve performances for this
    sequential write (quite reasonable). Default was relatime anyway, which
    is quite optimized.

    Regarding the bs parameter in dd: for the first benchmarks I posted one
    week ago, I noticed it didn't make any difference whether it was not set
    (default=512) or set to the stripe size (160KB) or set very high (I
    usually used 5120000). That's the reason why I didn't mention it. I
    supposed the elevator and/or page cache was compensating for the small
    value of bs.
    However in more recent tests, it did make a difference SOMETIMES. And
    this "sometimes" is the strangest thing I met in my tests:
    Sometimes with bs=512 write performances sucked real bad, such as being
    35-50MB/sec, this happened on file overwrite. In these cases I could
    confirm with iotop that dd speed was very variable, sometimes being as
    low as 3MB/sec, with brief spikes at 380MB/sec, averaging at a total of
    35-50MB/sec at the end of the 14GB write. I tried this test many times
    while changing the scheduler for all disks, trying both deadline and
    anticipatory, and the speed was consistently so low. Htop showed dd at
    5% CPU, pdflush usually at 0% with brief spikes at 70%, md1_raid5 at
    about 0.6%.
    After this I tried using bs=5120000 for the file rewrite, and in this
    case the write speed was back to normal at ~185MB/sec. After this I
    tried again with bs=512 rewriting the same file and the speed was STILL
    HIGH at ~185MB/sec!! At that point I could not experience the slow speed
    anymore whatever the bs. Something was unstuck in the kernel. This looks
    like a bug somewhere in the Linux to me.
    Later in my tests this slowness happened again and this time it was on
    the raw LVM device write! Exactly something I would have never expected
    to be slow. Writing to the lvm device was even slower than file rewrite:
    speed was down to 13MB/sec. When I used bs=5120000 the write speed to
    the raw device was high at ~185MB/sec. In this case however the "bug"
    was not resolved by writing with high bs once: everytime I wrote to the
    lvm device with bs=512 it was unbelievably slow at 12-13MB/sec,
    everytime I was writing with high bs speed was normal at ~185MB/sec. I
    alternated the two types of write a few times, and the problem was
    always reproducible and also independent of the disk scheduler. It is
    still reproducible today: writing to the LVM device with bs=512 is
    unbelievably slow at 12MB/sec. Still looks like bug to me... Ok maybe
    not, actually also writing to the raw MD device (4 disks raid-5, see
    below) with 512bytes writes causes the same performance problem. Maybe
    it's the read-modify-write (write hole) overhead. The MD code is
    probably not capable to use the page cache for caching the stripes
    then...? (See the question at the end of this post.) Or maybe yes but
    it's not capable to put into the page cache the data just read due to a
    write hole, so that the next 512bytes write again causes a write hole on
    the same stripe?

    Now some bad Read Speed.
    Oh man I had not noticed that the read speed was so slow on this computer.
    Read speed is around 50MB/sec with bs=512, and around 75MB/sec with
    bs=5120000. I found no way to make it faster. I tried ext3 and xfs, I
    tried AS and Deadline... no way. Reads from the raw LVM device are about
    80MB/sec (no bs) t 90MB/sec (high bs). I am positive that the array is
    not degraded. I am checking now the read speed for any single physical
    disk: it is 95MB/sec. Reading from the md1 raid-6 device (hence not
    passing from LVM) is 285MB/sec with dd=512 and 320MB/sec with
    dd=5120000!!! Heck it is the LVM layer that slows everything down so bad
    then!! I also have a raid-5 MD with 4 disks: reading from that one gives
    200MB/sec with bs=512 or 220MB/sec with bs=5120000.
    I am now retrying reading from the LVM device... yes, I confirm, it's
    bad. Hmm I shouldn't have used LVM then!! Instead of doing 5 logical
    volumes on LVM I should have made 5 partitions on each of the 12 disks
    and then making 5 separate MD raid-6 devices over those. I think I will
    do further investigation on this. Ok, ok... reading on the Internet it
    seems I have not aligned the beginning of the LVM devices to the MD
    stripes, hence the performance degradation.

    By comparison here is the write speed on the MD device: on the 4 disks
    raid-5 I can write at 103MB/sec (all controller caches enabled and
    bs=5120000 or it would be much slower) so this is much slower than the
    read speed from the same device which is 220MB/sec as I mentioned. I
    would really like to check also the sustaned write speed for 1 drive but
    I cannot do that now. Also unfortunately I cannot check the write speed
    on the raw 12-disks raid-6 MD device because it's full of data. The
    raid-5 instead was still empty.

    Regarding the disk scheduler: AS or Deadline does not make significant
    difference (consider the machine is not doing any other significant
    I/O), NOOP is the fastest of the three if bs is high, being something
    about 10% faster than the other two schedulers, however noop is very
    influenced by bs: if bs is low (such as 512) performances usually suck,
    so AS is probably better. I have not checked CFQ but I remember a few
    months ago I was not impressed by CFQ speed (which was like half of
    normal, on concurrent access, on this ubuntu kernel 2.6.24-22-openvz),
    also CFQ is not recommended for RAID.

    I have one question for you: in case one makes a very small write, such
    as 1 byte, to a MD raid 5-6 array, and then issues a sync to force the
    write, do you think linux MD raid 5-6 can use the page cache for getting
    the rest of the stripe (if present) so to skip the need of performing
    all the reads (for the raid "write hole"), and so be able to directly
    perform just the writes? In other terms, do you think the MD code is in
    a position that can fetch data on demand from the page cache?
    And after reading a stripe due to the first write hole, is it capable of
    putting it into the page cache so that a further write on the same
    stripe would not cause any more reads? (I suspect not, considering my
    tests in the dd/LVM paragraph above)

    Thanks for your suggestions
    kkkk, May 1, 2009
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.