gluster-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gluster-devel] Replicate/AFR Using Broadcast/Multicast?


From: Gordan Bobic
Subject: Re: [Gluster-devel] Replicate/AFR Using Broadcast/Multicast?
Date: Wed, 13 Oct 2010 09:06:56 +0100
User-agent: Mozilla-Thunderbird 2.0.0.24 (X11/20100328)

Beat Rubischon wrote:
Hello!

Quoting <address@hidden> (13.10.10 09:11):

I would also expect to see network issues as a cluster grows.
Performance reducing as the node count increases isn't seen as a bigger
issue?

I have pretty bad experience with Multicast. Running serveral clusters in
the range 500-1000 nodes in a single broadcast domain over the last year
showed that broadcast or multicast is able to kill your fabrics easily.

What sort of a cluster are you running with that many nodes? RHCS? Heartbeat? Something else entirely? In what arrangement?

Even the most expensive GigE switch chassis could be killed by 125+ MBytes
of traffic which is almost nothing :-)

Sounds like a typical example of cost not being a good measure of quality and performance. :)

The inbound traffic must be routed to
several outbount ports which results in congestion. Throttling and packet
loss is the result, even for the normal unicast traffic. Multicast or
broadcast is a nice way for a denial of service in a LAN.

If you have that many GlusterFS nodes, you have DoS-ed the storage network anyway purely by the write amplification of the inverse scaling. The idea is that you have a (mostly) dedicated VLAN for this.

In Infiniband multicast typically realized by looping through the
destinations directly on the source HBA. One of the main target in the
current development as multicast is a similar network pattern compared to
the MPI collectives. So there is no win in using multicast over such a
fabric.

Sure, but historically in the networking space, non-ethernet technologies have always been niche, cost ineffective in terms of price/performance and only had a temporary performance advantage.

At the moment the only way to achieve linear-ish scaling is to connect the nodes directly to each other, but that means that each node in an n-way cluster has to have at least n NICs (one to each other node plus at least one to the clients). That becomes impractical very quickly.

Using a parallel storage means you have communications from a large amount
of nodes to a large amount of servers. Using unicast looks bad in the first
point of view but I'm confident it's the better solution over all.

The problem is that the write bandwidth is fundamentally limited to the speed of your interconnect, and this is shared between all the nodes. So if you are on a 1Gb ethernet, your total volume of writes between all the replicas cannot exceed 1Gb. If you have 10 nodes, that limits your write throughput to 100Mb/s. The only way to scale is either by:

1) Putting ever more NICs into each storage chassis and turn the replication network into what is essentially a point-to-point network. Complex, inefficient and adding to the cost - 1Gb NICs are cheap, but 10Gb ones are still expensive. There is also a limit on how many NICs you can cram into a COTS storage node.

2) Upgrading to an ever faster interconnect (10Gb ethernet, Infiniband, etc.). Expensive and still doesn't scale as you add more storage nodes.

Right now more storage nodes means slower storage, and that should really be addressed.

Gordan



reply via email to

[Prev in Thread] Current Thread [Next in Thread]