RE: [lwip-users] Applications to TCP/IP offload

From:

Curt McDowell

Subject:

Date:

Fri, 4 Nov 2005 10:17:58 -0800

Jim,

Your analysis is very cool, and I appreciate the insight, especially as to the effect of increased latencies depending on the application. I'm not discouraged yet. There are pretty much two simple goals in this application. One is to save at as many host processor cycles as possible, and the other is to achieve line rate as measured by a ~1500-byte streaming test.

uC/OS-II looks nice, and it's priced right. Unfortunately, in this case our coprocessor has only 384kB of RAM, most of which is already allocated for large drivers and network buffers, so I've basically been prohibited from incorporating threads. I'm going to have to implement the coprocessor side as an interrupt-driven state machine.

The latest plan is to run just the raw API on the coprocessor, and write new sockets interface client/server layers that can operate over a true message passing boundary. As you say, hopefully I can model critical portions of this before committing to implement the whole thing.

Regards,

Curt McDowell

Broadcom Corp.

From: Jim Gibbons [mailto:address@hidden
Sent: Thursday, October 27, 2005 1:19 PM
To: address@hidden
Cc: 'Mailing list for lwIP users'
Subject: Re: [lwip-users] Applications to TCP/IP offload

With the coprocessor being so much slower than the host, I'm really concerned about the overall effect upon latencies, and perhaps even bandwidth. You could end up reducing TCP/IP performance by adding coprocessor functionality. I would again urge you to look at the fraction of time your host is spending in the TCP/IP stack, if at all possible. If you are bound by stack performance, that may devolve to determining the amount of time you are spending in the kernel as opposed to your app(s). If that fraction is small, then it may not be worth your while to try to reduce it. For example, if you are spending 90% of your time in your app and 10% of your time in TCP/IP, then cutting the TCP/IP time in half would only net you a small change in your performance.

If your protocol is heavily acknowledged and you find yourself being performance bound by the performance of the protocol, any additions to latencies will end up making you slower, not faster. All that is speculation on my part, of course. You could be compute bound with a streaming TCP/IP output, in which case additions to latencies wouldn't have any effect at all.

As for the RTOS question, you can find some surprisingly small ones. We have used uC/OS-II without being horrified by its size. Depending on the CPU you are using in the coprocessor, you may find that you have some pretty good options.

I do believe that it would probably be easiest to put on a top layer as you describe, but I also think that it would be feasible to transport the messages to the tcp thread as you originally described. As you note, there are some difficulties, and it is possible that the message contents will have to be augmented to deal with some of the existing data references. In either event, you will almost certainly find yourself tinkering with the stack in one way or another. The good news is that with a small open source project like this, it is definitely feasible to do this. The bad news is that it can still be a fair amount of work.

I'm really a bit conflicted about this. On the one hand, it does sound like a really interesting thing to do technically. On the other, it may actually end up costing you in system performance. I hope you'll be able to make a good analysis of the likely outcome before you commit yourself.

Curt McDowell wrote:

Thanks for the input, Jim.

>As for the performance improvement, that's a very significant question. First, I think that it is important to ask what kind of performance improvement you seek. If you are just seeking to offload the host, so that it can go on to do some other task faster, then you stand a reasonable chance of seeing that happen. If you are ultimately seeking to increase TCP/IP throughput, that will be a more difficult road.

In our case, the host processor would be about 4 times as powerful as the coprocessor. The coprocessor has some spare cycles, and it'll be there regardless of whether it ends up doing TOE. The goal is simply to reduce CPU consumption on the host processor with no reduction in throughput. The MAC has no checksum acceleration, so that's actually one of the most important things to off-load.

> I feel that your assessment of feasibility is sound and that your list of problems and their resolution is reasonably complete. Something always shows up in implementation, and I'm sure that your project will be no exception, but I do think that your design is solid.

I'm finding that splitting the modules in the manner depicted is not so easy after all.  E.g., for efficiency reasons the top layer routine netconn_write() calls tcp_sndbuf(), which peeks in the bottom layer data structure.  It's tempting to just add a top layer to RPC the whole sockets API (but unfortunately, the tiny RTOS on the TOE processor would then need to support threads).

Regards,

Curt McDowell

Broadcom Corp.

Curt McDowell wrote:

Hi,

I'm looking into using lwIP as the basis for a TOE (TCP/IP offload engine). If I understand correctly, the lwIP environment is implemented as one thread for the IP stack, and one thread for each application:

    APPLICATION THREAD                            IP STACK THREAD
App <-> Sockets <-> API-mux <------------> API-demux <-> Stack <-> netif
                            mbox transport

This architecture appears to lend itself fairly well to the following TOE implementation (actually, SOE, as it would be a full sockets offload):

         HOST PROCESSOR                     TOE ADAPTER W/ EMBEDDED CPU
+-------------+   +--------------+            +-------+   +----------+
| App using   |---| lwIP library |------------| lwIP  |---| Network |--->
| sockets API |   | Sockets API | Hardware | stack |   | hardware |
+-------------+   +--------------+    bus     +-------+   +----------+

- Does this assessment sound correct?
- Could a significant performance improvement be realized, compared to using a host-native IP stack?
- Is anyone else interested in this type of application?

The only problems that I see are with the mbox transport mechanism, in that it assumes a shared address space.

- It would need to send the data, instead of pointers to the data.
- It would need to send messages for event notifications instead of using callbacks.
- Message reception on either side of the hardware bus would be signaled through interrupts.

Thanks,
Curt McDowell
Broadcom Corp.

Jim Gibbons	address@hidden
Gibbons and Associates, Inc.	TEL: (408) 984-1441
900 Lafayette, Suite 704, Santa Clara, CA	FAX: (408) 247-6395

[Prev in Thread]

Current Thread

[Next in Thread]

RE: [lwip-users] Applications to TCP/IP offload, Curt McDowell <=