Writing to block device is *slower* than writing to the filesystem!?

Discussion in 'Storage' started by kkkk, Aug 7, 2009.

  1. kkkk

    kkkk Guest

    Hi all,
    we have a new machine with 3ware 9650SE controllers and I am testing
    hardware RAID and linux software MD raid performances
    For now I am on hardware RAID. I have setup a raid-0 with 14 drives.

    If I create an xfs filesystem on it (whole device, no partitioning,
    aligned stripes during mkfs, etc) then I write to a file with dd (or
    with bonnie++) like this:
    sync ; echo 3 > /proc/sys/vm/drop_caches ; dd if=/dev/zero
    of=/mnt/tmp/ddtry bs=1M count=6000 conv=fsync ; time sync
    about 540MB/sec come out (last sync takes 0 seconds). This is similar to
    3ware-declared performances of 561MB/sec
    http://www.3ware.com/KB/Article.aspx?id=15300

    however, if instead I write directly to the block device like this
    sync ; echo 3 > /proc/sys/vm/drop_caches ; dd if=/dev/zero of=/dev/sdc
    bs=1M count=6000 conv=fsync ; time sync
    performance is 260MB/sec!?!? (last sync takes 0 seconds)

    I tried many times and this is the absolute fastest I could obtain. I
    tweaked the bs, the count, I removed the conv=fsync... i ensured 3ware
    caches are ON on the block device, I set anticipatory scheduler... No
    way. I am positive that creating the xfs filesystem and writing on it is
    definitely faster than writing to the block device directly.

    How could that be!? Anyone knows what's happening?

    Please note that the machine is absolutely clean and there is no other
    workload. I am running kernel 2.6.31 (ubuntu 9.10 alpha live).

    Thank you
     
    kkkk, Aug 7, 2009
    #1
    1. Advertisements

  2. kkkk

    kkkk Guest

    Nope, it's not that. I seeked as you said to the end of the device and
    the speed is not significantly different. Writing to the device goes
    from 239 to 233 MB/sec (it's actually a bit faster at the beginning).

    I am positive that the seek value I used for dd is correct because I
    tried to raise it a bit further and it gave me error: dd: `/dev/sdc':
    cannot seek: Invalid argument

    Next idea...?

    Thank you!
     
    kkkk, Aug 8, 2009
    #2
    1. Advertisements

  3. kkkk

    kkkk Guest

    I found it! I found it!

    dd apparently does not buffer writes correctly (good catch, Mark):
    apparently disregards bs value and submits very small writes. It needs
    oflags=direct to really do that, and even then there's a limit. Also,
    elevator merging of small writes does not try hard enough and cannot
    achieve good throughput. More details tomorrow.
     
    kkkk, Aug 10, 2009
    #3
  4. :> Hi all,
    :> we have a new machine with 3ware 9650SE controllers and I am testing ...
    :
    :I found it! I found it!
    :
    :dd apparently does not buffer writes correctly (good catch, Mark):
    :apparently disregards bs value and submits very small writes. It needs
    :eek:flags=direct to really do that, and even then there's a limit. Also,
    :elevator merging of small writes does not try hard enough and cannot
    :achieve good throughput. More details tomorrow.

    Curious. I'm not seeing that behavior in either Centos 5 or Fedora 11
    (coreutils-5.97-19.el5, coreutils-7.2-2.fc11). In both of those, when I
    run:

    strace dd if=/dev/zero bs=1M count=1 of=somefile conv=fsync

    I see exactly one read and write, each of size 1048576.
     
    Robert Nichols, Aug 10, 2009
    #4
  5. kkkk

    kkkk Guest


    I haven't straced it but this is what appears from iostat -x 1 (grabbed
    from live iostat)

    Without direct: (bs=1M)

    Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
    avgrq-sz avgqu-sz await svctm %util

    sdc 0.00 559294.00 0.00 14384.00 0.00 570550.00
    39.67 143.98 9.96 0.07 100.00



    With direct: (bs=1M)

    Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
    avgrq-sz avgqu-sz await svctm %util

    sdc 0.00 0.00 0.00 3478.00 0.00 890368.00
    256.00 5.77 1.66 0.28 98.40



    You see, without direct there are a whole lot of wrqm/s (= probably lots
    of wasted CPU cycles), and the average submitted size is still 143.98 <
    256.0 (I suppose 143.98 is after the merges, correct?)

    With direct there are no wrqm/s, and the submitted request size is 256
    sectors exactly.


    With oflag=direct, performances increase with increasing bs, like this:

    (3ware 9650SE-16ML hw raid-0 256K chunk size, 14 disks [1TB 7200RPM SATA])
    bs size -> speed:
    512B -> 4.9MB/sec
    1K -> 13.3MB/sec
    2K -> 26.6MB/sec
    4K -> 54.1MB/sec
    8K -> 96MB/sec
    16K -> 157MB/sec
    32K -> 231 MB/s
    64K -> 300 MB/s
    128K -> 359 MB/s (from this point on, avgrq-sz does not increase
    anymore, but performances still increase)
    256K -> 404MB/sec
    512K -> 430MB/sec
    1M -> 456MB/sec
    2M -> 466MB/sec
    4M -> 473MB/sec
    3584K (stripe size) -> 494MB/sec
    8M -> 542MB/sec !! A big performance jump!!
    16M -> 543MB/sec
    32M -> 568MB/sec ! Another big performance jump
    64M -> 603MB/sec ! Again !! Here are CPU occupations: real 0m11.213s,
    user 0m0.004s, sys 0m3.880s
    128M -> 641MB/sec
    256M -> 676MB/sec
    512M -> 645MB/sec (performances start dropping)
    1G -> 620MB/sec

    Avgrq-sz apparently cannot go over 256 sectors, is this a hardware limit
    by the device, 3ware?

    Notwithstanding this, performances still increase up to bs=256M. From
    iostat the only apparent change (apart from increasing wsec/s obviously)
    is avgqu-sz, being < 1.0 up to bs=128K, and then raising to about 20.0
    at bs=256M. Do you think this can be the reason for the performance
    increase up to 256M?

    Thanks for any thoughts.
     
    kkkk, Aug 10, 2009
    #5
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.