Re: [Gluster-devel] Improving real world performance by moving files clo

gluster-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gluster-devel] Improving real world performance by moving files clo

From:	Gordan Bobic
Subject:	Re: [Gluster-devel] Improving real world performance by moving files closer to their target workloads
Date:	Fri, 16 May 2008 19:28:44 +0100
User-agent:	Thunderbird 1.5.0.12 (X11/20080430)

Derek Price wrote:

The DLM that GFS uses must already take this into account since itappears to work just fine, and the GPL'd code for that DLM wasofficially added to the Linux kernel with release 2.6.19, according toWikipedia. Not sure how portable that would be, but the source isavailable...

I'm not sure how portable that would be. A nice thing about GlusterFS isthat the only requirement if FUSE, which means it'll also work on Solaris.

If some HA and fault-tolerant DHT implementation exists that alreadyhandles atomic hash inserts with recognizable failures for keys thatalready exist, then perhaps that could take the place of DLM's quorummodel, but I think any algorithm that requires contacting all nodeswill prove to be a bad idea in the end.
Not all nodes - only the nodes that contain a certain file. A singleping broadcast to find out who has a copy of the file should prove tobe of insignifficant bandwidth overheat compared to actual filetransfers, unless you are dealing with a lot of files that aresignifficantly smaller than a network packet.
My point was that, as I understood your algorithm, a client would notknow which nodes contained a certain file until all nodes had beencontacted. So, while the actual bandwidth, even to consult thousands ofnodes, might be small relative to file transfer bandwidth, the clientcan't assume it has a complete answer until it gets all the replies,meaning requests to downed nodes have timed out.

I agree that waiting for all nodes could be an issue in case of downednodes, and I concur that quorum would be a good work-around.

Broadcasting a single packet (should easily fit into a single 1500 byteethernet frame) so all nodes isn't _hugely_ expensive.

Multicast is usually UDP, so there's no TCP timeouts/retries to contendwith. It wouldn't matter if some nodes are down - we can act as soon aswe have answers from (n/2)+1 nodes, assuming in the case of requesting afile that isn't local, that one of those peers has the file.

Meaning that if youassume that at least one node will always be down, then the minimum timeto locate a node with the most recent copy of the file (and thus theminimum time to begin any read) is always the timeout attached waitingfor the ping reply.

There are ways around that. Flag a node as being out of the cluster whenquorum decides it is unresponsive, and fence it.

Having the entire quorum aware of which version of each file is the mostrecent and where to find the file avoids this problem, again, until justless than half the nodes become unreachable.

There should, in theory, be only one version of the file in the entirecluster. If there isn't, then the AFR auto-heal should be invoked to seeto it that there is only one. The important thing is to know which nodeshave a copy of the file.

I might optimize the expunge algorithm slightly by having nodes withlow loads volunteer to copy files that otherwise couldn't be expungedfrom a node. Better yet, perhaps, would be a background process thatruns on lightly loaded nodes and tries to create additional redundantcopies at some configurable tolerance beyond the "minimum # ofcopies" threshold.
Not just lightly loaded nodes, but more importantly, nodes with mostfree space available. :)
Yes, the algorithm to detect "loading" should probably consider as manyresource constraints as appears practical.

Load in terms of performance is a non-critical optimization. Spacerequirements being met is a mandatory requirement. :)

For file delta writes, an AFR type mechanism could be used to sendthe deltas to all the nodes that have the file. This could all getquite tricky, because it might require a separate multicast group tobe set up for up to every node combination subset, in order to keepthe network bandwidth down (or you'd just end up broadcasting to allnodes, which means things wouldn't scale as switches should, it'd bemore like using hubs).
This would potentially have the problem that there is only 24 bitsof IP multicast address space, but that should provide enough groupswith sensible redundancy levels to cover all node combinations. Thismay or may not be way OTT complicated, though. There is probably asimpler and more sane solution.
I'm not sure what overhead is involved in creating multicast groups,but they would only be required for files currently locked for write,so perhaps creating and discarding the multicast groups could be donein conjunction with creation and release of write locks.
Sure, these could be dynamic, but setup and teardown might causeenough overhead that you might as well be broadcasting all the locksand writes, and just expect the affected nodes to pick those out ofthe air and act on them.
It's also possible that you could reduce the complexity of thisproblem by simply discarding as many copies down to as close to theminimum # as other nodes will allow, on write. However, I think thatmight reduce some of the performance benefits this design otherwisegives each node.
Also remember that the broadcasts or multicasts would only actually beuseful for locks and file discovery. The actual read file transferwould be point-to-point and writes would be distributed to only thesubset of nodes that are currently caching the files.
Read would be point-to-point (perhaps multi-point to point for implicitread striping across all known valid copies?), but it could still beuseful to use multi-cast for write, especially if the redundant copieswere behind a different switch than the node accepting the write. Somulti-cast setup could happen when a server obtained a write lock, andteardown would be delayed until synchronization of redundant copies hadcompleted.

Possibly, but if the number of possible node connections could beenumerated WRT given number of nodes and minimum required redundancy,setting them up statically and using a hash-lookup would probably bequicker, as it wouldn't require constant setups/teardowns.


We have 2^24 possible multicast "channels" (addresses).

Number of possible ways to pick k nodes out of n (files beingk-redundant) is

n! / k! (n-k)!

Whether these constraints would allow for sufficiently large clusters, Idon't know.

There would need to be special handling of a case where a nodeaccepting a big write is running out of space as a consequence andsomething has to be dropped. Obviously, none of the currently openfiles can be discarded, so there would need to be some kind of anauxiliary process that would make a node request a "volunpeer" (punintended) to take over a file that it needs to flush out, ifdiscarting it would bring the redundancy below the required threshold.
I think this could be worked into the normal expunge algorithm with aproperty like: "ANY request to expunge a file that reduces the filecount below the redundancy threshold will ALWAYS generate a volunpeer IFat least one node exists with the disk space available".


Yes. Failing that, we could try the next LRU file.

It wouldn't require any special casing - the needed space will alwaysbecome available upon expunge if space for the migrating file existsanywhere on the network. If all the files are expunged, or they can'tbe even with this property of expunge, and the local disk still fillsup, then I think it would be reasonable for the FS to return a disk fullerror.


Agreed.

Gordan

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Gluster-devel] Improving real world performance by moving files closer to their target workloads, (continued)

Prev by Date: Re: [Gluster-devel] can mount several times
Next by Date: Re: [Gluster-devel] booster translator error
Previous by thread: Re: [Gluster-devel] Improving real world performance by moving files closer to their target workloads
Next by thread: Re: [Gluster-devel] Improving real world performance by moving files closer to their target workloads
Index(es):
- Date
- Thread