SORT BY:

LIST ORDER
THREAD
AUTHOR
SUBJECT


SEARCH

IPS HOME


    [Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

    SNACK and recovery



    This turns out to be a matter not just of rarity,
    but also one of consequences.  As Mark points
    out, for tapes and similar devices, the consequences
    are disastrous - the backup aborts, and when
    "those in charge" come in the next morning,
    they have no usable backup tape, and are very
    unhappy.  While Jon says "streams devices must
    support abort and retry for extreme errors in
    any case", the abort may well be the entire
    backup and the retry might be next weekend ...
    not a good situation.
    
    Over in Fibre Channel world, FCP-2 contains
    recovery support that resulted from the
    discovery that despite the fact that non-
    delivery of a Fibre Channel frame (Class 2 or
    3 - it doesn't matter which) is "extremely
    rare":
    - Buffer overrun is prevented by both link
    	and end-to-end buffer usage controls.
    - FC switches are engineered to not drop
    	frames to the maximum extent possible
    	due in part to these consequences.
    - There's a 32-bit CRC covering the entire
    	FC frame.
    failure to deliver a frame happens often enough
    that a recovery mechanism is needed to avoid
    tape backup aborts and the like.  Unlike TCP,
    Fibre Channel has no built-in retransmit
    mechanism.
    
    In contrast to Fibre Channel, we are dealing
    with something rarer because TCP retransmit will
    take care of most things that can go wrong in
    switches and there's a 16 bit checksum whose
    failure will trigger retransmits.  What this
    appears to come down to is:
    
    - Does a 16-bit TCP checksum catch enough of
    the corruption events to make it acceptable to
    take drastic measures like aborting a backup
    when a 32 bit CRC fails on a response that
    made it through the 16 bit checksum?
    
    The discussion's been a bit convoluted.  Some
    simple yes/no answers to the above question
    accompanied by short reasoning would be appreciated.
    I think Julian's said "no" and quoted a filesystem
    number that we're awaiting a reference to.
    
    Just to muddy the waters further, let me point out
    that tape targets tend to be less complex than
    disk targets.  Tapes don't reorder commands, and
    often don't even queue them.  Saving the last N
    responses is not that difficult when the responses
    go out in the order that the commands came in
    (easier to organize saving them), and the initiator
    has to be very careful about the number of commands
    in flight to avoid disasters caused by dropped
    commands (should lead to reasonable results from
    relatively small values of N).
    
    --David
    
    ---------------------------------------------------
    David L. Black, Senior Technologist
    EMC Corporation, 42 South St., Hopkinton, MA  01748
    +1 (508) 435-1000 x75140     FAX: +1 (508) 497-8500
    black_david@emc.com       Mobile: +1 (978) 394-7754
    ---------------------------------------------------
    
    


Home

Last updated: Tue Sep 04 01:05:10 2001
6315 messages in chronological order