Re: [Gluster-devel] Improving real world performance by moving files clo

gluster-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gluster-devel] Improving real world performance by moving files clo

From:	Derek Price
Subject:	Re: [Gluster-devel] Improving real world performance by moving files closer to their target workloads
Date:	Fri, 16 May 2008 13:26:51 -0400
User-agent:	Thunderbird 2.0.0.14 (Windows/20080421)

address@hidden wrote:

I'm assuming that versioning and locking can and should be combined.You've admitted the necessity for keeping copies of files synchronizedand IO is always going to require some sort of lock to accomplishthis. By having the quorum remain aware of what the most recentversion of a given file is, whether that file is locked, and perhapswhere copies of the file reside, you could reduce the number of nodesthat must be consulted when a lock is needed.
True enough, but some care would need to be exercised to ensure that athis doesn't lead to edge cases where a node thinks it still has a lock,but all the other nodes have expired it (e.g. temporary network outage).

The DLM that GFS uses must already take this into account since itappears to work just fine, and the GPL'd code for that DLM wasofficially added to the Linux kernel with release 2.6.19, according toWikipedia. Not sure how portable that would be, but the source isavailable...

I think you will also speed things up if you don't have to consult allnodes for every IO operation. If all available nodes must beconsulted, then you introduce an implicit wait until a specifiedtimeout for every IO request if any single node is down. With thequorum model, even before fencing takes place, almost half the nodescan go incommunicado and the rest can operate as efficiently as theydid with all nodes in service.
Indeed quorum of (n/2)+1 nodes should, in theory, suffice for safelygranting a lock, but it would probably mean that the locks should berefreshed several times more often than the default lock TTL, just toaccount for scope of packet loss. Releases of locks should, of course,be explicitly notified to the cluster.

Yes. Again, perhaps the DLM from RedHat's cluster project alreadysolves this?

Just brainstorming quorum theory, but if scalability of the quorum modelbecomes an issue, perhaps the quorum could only be consulted to elect,discover, and verify a mini-quorum of nodes (perhaps elected partiallybased on the fact that they reside behind different switches). Thenonce a node was aware of the identity of the mini-quorum, it would onlyhave to cummunicate with a single node for locking and file discoverypurposes. This assumes that hosts in the mini-quorum would refuse tocooperate with a client if they could not confirm with the quorum thatthey were still part of the mini-quorum or couldn't contact the rest ofthe mini-quorum, of course, and would consult the quorum to elect a newmini-quorum when either of these was not the case.

If some HA and fault-tolerant DHT implementation exists that alreadyhandles atomic hash inserts with recognizable failures for keys thatalready exist, then perhaps that could take the place of DLM's quorummodel, but I think any algorithm that requires contacting all nodeswill prove to be a bad idea in the end.
Not all nodes - only the nodes that contain a certain file. A singleping broadcast to find out who has a copy of the file should prove to beof insignifficant bandwidth overheat compared to actual file transfers,unless you are dealing with a lot of files that are signifficantlysmaller than a network packet.

My point was that, as I understood your algorithm, a client would notknow which nodes contained a certain file until all nodes had beencontacted. So, while the actual bandwidth, even to consult thousands ofnodes, might be small relative to file transfer bandwidth, the clientcan't assume it has a complete answer until it gets all the replies,meaning requests to downed nodes have timed out. Meaning that if youassume that at least one node will always be down, then the minimum timeto locate a node with the most recent copy of the file (and thus theminimum time to begin any read) is always the timeout attached waitingfor the ping reply.

Having the entire quorum aware of which version of each file is the mostrecent and where to find the file avoids this problem, again, until justless than half the nodes become unreachable.

I might optimize the expunge algorithm slightly by having nodes withlow loads volunteer to copy files that otherwise couldn't be expungedfrom a node. Better yet, perhaps, would be a background process thatruns on lightly loaded nodes and tries to create additional redundantcopies at some configurable tolerance beyond the "minimum # of copies"threshold.
Not just lightly loaded nodes, but more importantly, nodes with mostfree space available. :)

Yes, the algorithm to detect "loading" should probably consider as manyresource constraints as appears practical.

For file delta writes, an AFR type mechanism could be used to sendthe deltas to all the nodes that have the file. This could all getquite tricky, because it might require a separate multicast group tobe set up for up to every node combination subset, in order to keepthe network bandwidth down (or you'd just end up broadcasting to allnodes, which means things wouldn't scale as switches should, it'd bemore like using hubs).
This would potentially have the problem that there is only 24 bits ofIP multicast address space, but that should provide enough groupswith sensible redundancy levels to cover all node combinations. Thismay or may not be way OTT complicated, though. There is probably asimpler and more sane solution.
I'm not sure what overhead is involved in creating multicast groups,but they would only be required for files currently locked for write,so perhaps creating and discarding the multicast groups could be donein conjunction with creation and release of write locks.
Sure, these could be dynamic, but setup and teardown might cause enoughoverhead that you might as well be broadcasting all the locks andwrites, and just expect the affected nodes to pick those out of the airand act on them.
It's also possible that you could reduce the complexity of thisproblem by simply discarding as many copies down to as close to theminimum # as other nodes will allow, on write. However, I think thatmight reduce some of the performance benefits this design otherwisegives each node.
Also remember that the broadcasts or multicasts would only actually beuseful for locks and file discovery. The actual read file transfer wouldbe point-to-point and writes would be distributed to only the subset ofnodes that are currently caching the files.

Read would be point-to-point (perhaps multi-point to point for implicitread striping across all known valid copies?), but it could still beuseful to use multi-cast for write, especially if the redundant copieswere behind a different switch than the node accepting the write. Somulti-cast setup could happen when a server obtained a write lock, andteardown would be delayed until synchronization of redundant copies hadcompleted.

There would need to be special handling of a case where a node acceptinga big write is running out of space as a consequence and something hasto be dropped. Obviously, none of the currently open files can bediscarded, so there would need to be some kind of an auxiliary processthat would make a node request a "volunpeer" (pun intended) to take overa file that it needs to flush out, if discarting it would bring theredundancy below the required threshold.

I think this could be worked into the normal expunge algorithm with aproperty like: "ANY request to expunge a file that reduces the filecount below the redundancy threshold will ALWAYS generate a volunpeer IFat least one node exists with the disk space available".

It wouldn't require any special casing - the needed space will alwaysbecome available upon expunge if space for the migrating file existsanywhere on the network. If all the files are expunged, or they can'tbe even with this property of expunge, and the local disk still fillsup, then I think it would be reasonable for the FS to return a disk fullerror.


Regards,

Derek
--
Derek R. Price
Solutions Architect
Ximbiot, LLC <http://ximbiot.com>
Get CVS and Subversion Support from Ximbiot!

v: +1 248.835.1260
f: +1 248.246.1176

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Gluster-devel] Improving real world performance by moving files closer to their target workloads, (continued)

Prev by Date: Re: [Gluster-devel] booster translator error
Next by Date: Re: [Gluster-devel] Improving real world performance by moving files closer to their target workloads
Previous by thread: Re: [Gluster-devel] Improving real world performance by moving files closer to their target workloads
Next by thread: Re: [Gluster-devel] Improving real world performance by moving files closer to their target workloads
Index(es):
- Date
- Thread