gluster-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gluster-devel] Client side AFR race conditions?


From: Anand Babu Periasamy
Subject: Re: [Gluster-devel] Client side AFR race conditions?
Date: Fri, 02 May 2008 17:04:45 -0700
User-agent: Mozilla-Thunderbird 2.0.0.12 (X11/20080420)

Let me explain more about this issue:

When multiple applications write to a same file, it is really
application's responsibility to handle coherency using
POSIX locks or some form of IPC/RPC mechanism.  Even without
AFR, file system's do not guarantee order of writes and hence
integrity of data. When AFR is inserted, this corruption may
lead to disparate set of data other than overwrites.

It shouldn't be seen as an issue with AFR. If applications
handle coherency, AFR will work fine. It is possible to
introduce atomic-write option (locked writes) in AFR, but
it is useless, because it still cannot avoid corruption
because one application overwrote the data of the other,
without holding a lock.

In summary, AFR doesn't have race condition.
--
Anand Babu Periasamy
GPG Key ID: 0x62E15A31
Blog [http://ab.freeshell.org]
The GNU Operating System [http://www.gnu.org]
Z RESEARCH Inc [http://www.zresearch.com]



Martin Fick wrote:
--- Krishna Srinivas <address@hidden> wrote:
I am curious, is client side AFR susceptible
to race conditions on writes?  If not, how is this
mitigated?

This is a known issue with the client side AFR.

Ah, OK.  Perhaps it is already documented somewhere,
but I can't help but think that perhaps the AFR
translator deserves a page dedicated to some of the
design trade offs made and the impact the they have. With enough thought, it is possible to deduce/guess at
some of the potential problems such as split brain and
race conditions, but for most of us this is still a
guess until we ask on the list.  Perhaps with the help
of others I will setup a wiki page for this.  This
kind of documented info would probably help situations
like the one with Garreth where he felt mislead by the
glusterfs documentation.

We can solve this by locking but there will be performance hit. Of course if applications lock themselves then all will be fine. I feel we can have

it as an option to disable the locking
in case users are more concerned about performance.

Do you have any suggestions?

I haven't given it a lot of thought, but, how would
the locking work?  Would you be doing:

  SubA          AFR      application     SubB
    |            |            |            |
    |            |<---write---|            |
    |            |            |            |
    |<---lock----|-----------lock--------->|
    |---locked-->|<---------locked---------|
    |            |            |            |
    |<--write----|----------write--------->|
    |--written-->|<--------written---------|
    |            |            |            |
    |<--unlock---|----------unlock-------->|
    |--unlocked->|<--------unlocked--------|
    |            |            |            |
    |            |---written->|            |


because that does seem to be a rather large 3
roundtrip latency versus the current single rountrip,
not including all the lock contention performance
hits!  This solution also has the problem of lock
recovery if a client dies.

If instead, a rank (which could be configurable or
random) were given to each subvolume on startup, one
alternative would be to always write to the highest
ranking subvolume first:

   (A is a higher rank than B)

  SubA         AFR         Application        SubB
    |           |               |               |
    |           |<----write-----|               |
    |<--write---|               |               |
    |--version->|               |               |
    |           |----written--->|               |
    |           |               |               |
    |           |----------(quick)heal--------->|
    |           |<------------healed------------|

The quick heal would essentially be the write but
knowing/enforcing the version # returned from the SubA
write.  Since all clients would always have to write
to SubA first, then SubA's ordering would be reflected
on every subvolume. While this solution leaves a
potentially larger time when SubB is unsynced, this
should maintain the single roundtrip latency from an
application's standpoint and avoid any lock contention
performance hits?  If a client dies in this scenario,
any other client could always heal SubB from SubA, no
lock recovery problems.


Both of these solutions could probably be greatly
enhanced with a write ahead log translator or some
form of buffering above each subvolume, this would
decrease the latency by allowing the write data to be
transferred before/while the lock/ordering info is
synchronized. But this may be rather complicated? However, as is, they both seem like fairly simple
solutions without too much of a design change?


The non locking approach seems a little odd at first
and may be more of a change to the current AFR method
conceptually, but the more I think about it, the more
it seems appealing.  Perhaps it would not actually
even be a big coding change?  I can't help but think
that this method could also potentially be useful to
eliminate more splitbrain situations, but I haven't
worked that out yet.
There is a somewhat suttle reason, but it makes sense
that the locking solution is slower since locking
enforces serialization across all the writes.  This
serialization is not really what is needed; we only
need to ensure that the potentially unserialized
ordering is the same on both subvolumes.

Thoughts?

-Martin


P.S. Simple ascii diagrams generated with:
http://www.theficks.name/test/Content/pmwiki.php?n=Sdml.HomePage



      
____________________________________________________________________________________
Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ


_______________________________________________
Gluster-devel mailing list
address@hidden
http://lists.nongnu.org/mailman/listinfo/gluster-devel




reply via email to

[Prev in Thread] Current Thread [Next in Thread]