Out-of-order writing by disk drives

Discussion in 'Storage' started by Anton Ertl, Apr 7, 2009.

  1. Anton Ertl

    Anton Ertl Guest

    I have released a new version of hdtest, a program that tests whether
    hard disks write out-of-order relative to the order that the writes
    were passed to them from the OS. You find the program at

    http://www.complang.tuwien.ac.at/anton/hdtest/

    Here I mainly present the results from my tests, and explain enough
    about the program so you know what I am talking about.


    HOW DOES IT WORK?

    It writes the blocks in an order like this:

    1000-0-1001-0-1002-0-...

    This sequence seems to inspire PATA and SATA disks to write
    out-of-order (in the order 1000-1001-1002-...-0). So you turn off the
    drive's power while running the program. The written blocks contain
    certain data that another program from the suite can check after you
    power the drive up again.


    RESULTS

    I performed two sets of tests, one in November 1999, and one in April
    2009. The results have not changed much. In both tests disks wrote
    data seriously out-of-order in their default configuration; they can
    delay the writing of block 0 in this test for quite a long time.

    In more detail:

    In 2009 I tested three drives (and accessed the whole drive) under
    Linux 2.6.18 on Debian Etch; the USB enclosure used was a Tsunami
    Elegant 3.5" Enclosure that has PATA and SATA disk drive interfaces.

    * Maxtor L300R0 PATA (300GB) connected through an USB enclosure: In
    two tests it wrote the consecutive blocks 47 and 34 blocks after the
    last written block 0.

    * Seagate ST340062 Model 0A PATA (7200.10, 400GB):
    connected through a USB enclosure:
    3 times the result was as if it had written the blocks in-order
    1 time it wrote 3064 blocks out-of-order
    2 times it wrote 18384 blocks out-of-order
    connected directly via PATA cable:
    1 time it wrote 1972 blocks out-of-order

    * Seagate ST340062 Model 0AS SATA (7200.10, 400GB) connected through a
    USB enclosure:
    1 time the result was as if it had written the blocks in-order
    2 times it wrote 3064 blocks out-of-order
    1 time it wrote 6128 blocks out-of-order
    1 time it wrote 12256 blocks out-of-order
    1 time it did not write block 0 at all

    It is interesting that the number of blocks that is found to be
    out-of-order is often a multiple of 3064. Maybe this is a multiple of
    a track size; no other explanations come to mind.

    In 1999 I tested two drives (and accessed one partition) under
    Linux-2.2.1 on RedHat 5.1. The two drives were a Quantum Fireball
    CR8.4A (8GB) and an IBM-DHEA-36480 (6GB), both connected directly via
    PATA. I did one test with each of the disks, and they did not even
    write block 0 once on the platters before I turned off the power.

    I also tested the Quantum with write caching disabled (hdparm -W 0).
    Hdtest was now quite noisy and produced the in-order result.


    CONCLUSION

    Applications and file systems requiring in-order writes (i.e.,
    basically all of them) should use barriers or turn off write caching
    for the disk drive(s) they use. Unfortunately, the Linux ext3 file
    system does not use barriers by default; use the mount option
    barrier=1 to enable them, e.g. by putting a line like this in
    /etc/fstab:

    /dev/md2 /home ext3 defaults,barrier=1 1 2

    Followups set to comp.arch.storage

    - anton
     
    Anton Ertl, Apr 7, 2009
    #1
    1. Advertisements

  2. Anton Ertl

    Anton Ertl Guest

    Yes. But some people seem to imagine that this is a very small effect
    that can be ignored without ill effects on the consistency of the
    on-disk data of a file system; this attitude is exemplified by having
    barrier=1 disabled by default in the ext3 file system in Linux.

    The test demonstrates that the reordering can happen over several
    seconds.
    That's a very good explanation. Given that the program ran
    significantly slower (about 6MB/s transfer rate) than what the drive
    is capable of (>70MB/s), it's not surprising that most tests resulted
    in turning off the drive between such batches, with only one happening
    during a batch.

    Hmm, I could test my track-size theory by working on another area of
    the drive (but I am probably too lazy to do that; your theory sounds
    better anyway:); if it's really a multiple of the track size, the
    number should change, because the track size varies.

    BTW, I used blocks with 1KB, so it's 6128 sectors.

    - anton
     
    Anton Ertl, Apr 8, 2009
    #2
    1. Advertisements

  3. Anton Ertl

    Anton Ertl Guest

    Yes, nowadays you can have them without turning off write caching
    completely, so it's entirely reasonable.

    There are file systems like ext3 with data=ordered or data=journal or
    BSD FFS with soft updates that do give guarantees about ordering. But
    in order to implement these guarantees they must take the explicit
    steps, and ext3 does not do that by default.
    At 70MB/s and 7200rpm=120/s the track size is at least
    70(MB/s)/120(/s)=0.583MB. Probably a little larger because aligning
    the head for the next platter or moving it to the next cylinder also
    costs a little time on each revolution.

    My guess (inspired by you) is that it destaged 3064KB at a time. The
    slow transfer rate is probably a result of doing synchronous writes to
    the disk buffers; the write would only report completion when the data
    has arrived in the disk's buffers, and only then the next write would
    start and weave its way through the various subsystems.
    Yes, if it waited about a second for the 3064KB to accumulate (the
    other 3MB/s are spent writing block 0 repeatedly) and then needed 40ms
    to write them to the platters, there would have been ample time to
    write the block 0 between each batch. My guess is that it tries to
    write the blocks roughly in the order of age, and block 0 is rarely
    the oldest one it sees because it gets overwritten by younger
    instances all the time.
    No. How would you test that? But given the results of this test, it
    seems most plausible to me that the ST340062 destages 3064KB at a time
    if it gets that much sequential data.

    - anton
     
    Anton Ertl, Apr 8, 2009
    #3
  4. been the focus of my comment: NTFS does attempt to control ordering, at
    Yes, NTFS is careless about ordering on anything but the logfile, and logfile updates are done using the FUA bit.

    I don't know off-head how FUA bit is interpreted by the Windows (S)ATA stack, but, from what I remember, ATA spec had no analogs at all until rather recently.

    Probably Windows (S)ATA stack can flush the whole in-drive cache before completing the FUA request, but this is the guess in the wild and can be wrong.
     
    Maxim S. Shatskih, Apr 9, 2009
    #4
  5. Anton Ertl

    Anton Ertl Guest

    That "option" is the default for ext3. Concerning other Unix file
    systems, they at least try to preserve metadata consistency on
    crashes, and to do that, they need guarantees about the order of
    writes. Journaling file systems need guarantees about the order of
    journal writes as well as journal writes relative to the writes the
    journal entries describe.

    And even the bad old BSD FFS tried to perform sychnronous metadata
    writes in order to preserve metadata consistency, and required these
    writes to happen in-order in order for the fsck to be able to recover
    the metadata; if the writes occured in order, only one block could be
    wrong, and the fsck relied on that.
    Write-back delays are one thing, out-of-order writes are a very
    different thing. Delaying the writes means that one loses a few
    seconds worth of changes on a crash (that may or may not be
    acceptable); out-of-order writes can destroy the consistency of the
    file system.
    IMO running any file system that contains data that's worth preserving
    on a drive in a mode that allows reordering to happen beyond the
    control of the file system is irresponsible on part of the sysadmin;
    and making such a behaviour the default is irresponsible on part of
    the file system developer. I.e., if the drive offers support for
    queuing or tagged commands, the file system should use them by
    default, and if it doesn't, the file system should turn off write
    caching on the drive by default.
    IMO the delays come from latencies in the communication between user
    process, various kernel components, the host adapter, and the disk
    controller, because all of that's going on synchronously (the only
    asynchronous part there is the writing of the data to the platters);
    therefore I don't expect the maximum disk write rate to have little
    influence. I expect that if I double the block size, the transfer
    rate doubles until I approximate the maximum transfer rate.
    The non-block-0 writes start in the middle of the device.
    That's certainly an interesting test. Maybe next time (another ten
    years?).

    - anton
     
    Anton Ertl, Apr 9, 2009
    #5
  6. writes. Journaling file systems need guarantees about the order of
    Usually, the update is first written to the journal (and must reach the hard disk media) and only then is reflected in the actual metadata.

    In this case, it is enough to only use FUA (or some similar thing emulated on ATA, for instance, drive's cache flush after each such write) on journal writes.
     
    Maxim S. Shatskih, Apr 9, 2009
    #6
  7. Anton Ertl

    Anton Ertl Guest

    Yes, any feature that ensures partial ordering is sufficient. But
    using write caching without any such features is not.

    - anton
     
    Anton Ertl, Apr 10, 2009
    #7
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.