RE: [lwip-users] How to optimize raw UDP performance

There's some risk with disabling UDP checksum, but it's low. From what I see, UDP loss is on the magnitude of packets, not bytes in packets. In Windows, it can be bad. I've seen 30-40 contiguous dropped packets just minimizing a window (even an application other than the UDP-based one). OTOH, I can run 150,000 packets a second for an hour without a drop (on a LAN as a matter of fact). Be prepared to contend with the non-lwIP end of the connection at higher speeds.

Optimize your Ethernet driver. If you can't send UDP packets at 980+MbS, you're not optimal in your driver. Although you don't need that speed, the faster each packet is sent the better. Note: use a static copy of one UDP pbuf and send it repeatedly.

This will help you:

lwIP Output Call Tree (per packet)
udp_sendto	Looks up which netif has IP addr - calls:
udp_sendto_if	Adds UDP header, fills in, chksums it - calls:
ip_output_if	Adds IP header, fills in, chksums it - calls:
etharp_output	Adds Eth header, fills in, chksums it - calls:
etharp_query	Looks up MAC from dest IP (ARP) - calls:
etharp_send_ip	Fills in 2 MAC addrs, calls:
netif->linkoutput	Raw packet send

The later you make your call here, the better. There is a HUGE difference between udp_sendto and etharp_query! The speed killer is this is the ARP lookup, the redundant address checks, a few pbuf_header calls and small copies in these routines. I optimized etharp.c using a faster cache test, moved these functions to onchip memory (this is big if you can do this), and removed the SMEMCPYs and for-loop MAC copies to use more efficient copies and then using etharp_query got over 700MbS (100MHz Cyclone III FPGA running NIOS II – this may be close to your platform). Compare this to udp_sendto_if which was only about 325MbS. In the end I resorted to using my own routines to build UDP packets (one pbuf with the IP/UDP header chained to the payload). With checksums disabled I get 969MbS. (I had a goal to get close to the wire speed if possible.) I had to time this on the target side – Windows can only keep up with short bursts at this speed (250 packets or less) and WireShark has some difficulties but will also capture short bursts. I timed the times of 100 packets in WireShark to validate my times recorded on the target. These times were taken with nothing else going on in the system.

My speeds reflect changes in several areas and do not reflect what is possible *only* changing lwIP or optimizing the driver. My goal was to make Ethernet communications as fast as possible without rules of what to change and not to change.

Bill

>-----Original Message-----

>From: address@hidden

>[mailto:address@hidden On

>Behalf Of Max Bobrov

>Sent: Thursday, September 24, 2009 1:17 PM

>To: Mailing list for lwIP users

>Subject: Re: [lwip-users] How to optimize raw UDP performance

>Bill: Thank you! disable CHECKSUM_CHECK_UDP and CHECKSUM_GEN_UDP gave

>a considerable increase in performance. Xilinx gui interface for lwip

>could use some significant improvement to make this and many other

>features more accessible.

>Chris: I've increased some of these values (listed below) but haven't

>seen much improvement from that. Do these look ok or have you had

>better success with others?

>#define MEM_ALIGNMENT 8

>#define MEM_SIZE 262144

>#define MEMP_NUM_PBUF 32

>#define MEMP_NUM_UDP_PCB 8

>#define MEMP_NUM_TCP_PCB 32

>#define MEMP_NUM_TCP_PCB_LISTEN 8

>#define MEMP_NUM_TCP_SEG 256

>#define LWIP_USE_HEAP_FROM_INTERRUPT 1

>#define MEMP_NUM_SYS_TIMEOUT 8

>#define PBUF_POOL_SIZE 256

>#define PBUF_POOL_BUFSIZE 2048

>#define PBUF_LINK_HLEN 16

>On Wed, Sep 23, 2009 at 11:06 PM, Chris Strahm <address@hidden>

>wrote:

>> Actually someone else reported to me that turning the checksum off in

>lwIP

>> actually made it slower. I have not checked the reason for this, but

>that

>> was someone else's experience. There is a big difference in whether

>you use

>> 8/16/32 bit memcpy type routines. Also if you can write it in asm.

> Since

>> yours is FPGA, little different. Also same kind of thing for

>checksum. Asm

>> will be faster. Sometimes the difference in how a particular variable

>or

>> address pointer is generated by C can result in very big difference in

>code.

>> You have to look at everything when it comes to high performance.

>> Also what is the size of your PBUFs and your blocks in your DMA or MAC

>ISR.

>> I assume for a 1G Enet system you probably want the maximum, about

>1536

>> each.

>> Chris.

>> _______________________________________________

>> lwip-users mailing list

>> address@hidden

>> http://lists.nongnu.org/mailman/listinfo/lwip-users

>_______________________________________________

>lwip-users mailing list

>address@hidden

>http://lists.nongnu.org/mailman/listinfo/lwip-users

From:	Bill Auerbach
Subject:	RE: [lwip-users] How to optimize raw UDP performance
Date:	Thu, 24 Sep 2009 14:54:16 -0400