On 5/5/2010 8:53 AM, address@hidden wrote:
Tyrel Newton wrote:
As to the aligned pbuf payload: I think the code currently relies on mem_malloc returning aligned data (and that should be OK with your current settings), so you might want to check the return values of your libc malloc.
As the pbuf code is written (I think I'm running the latest stable,
1.3.2), there is no way to guarantee a 32-bit aligned payload pointer at
the start of the Ethernet frame with MEM_ALIGNMENT=4. This is because in
pbuf_alloc, the payload pointer for PBUF_RAM types is initialized at an
offset that is itself memory aligned (this offset is equal to the size
of pbuf structure plus the various header lengths). When the 14 byte
Ethernet header is eventually uncovered, it will always be 16-bit
aligned since the the original payload pointer was 32-bit aligned. This
of course assumes a PBUF_LINK_HLEN=14.
I see... I must say I didn't check that yet. And as my code itself
requires the payload aligned (or I would have to use packed structs to
access the contents), I just ended up with 16-bit DMA transfers (using
an Altera NIOS-II system with a standard Altera RAM-to-RAM DMA-engine).
I always planned to write my own DMA engine in VHDL that can do 32-bit
transfers from 16-bit aligned data, but I didn't make it, yet.
Anyway, if there is a requirement to let pbuf_alloc produce an
unaligned payload so that the outer header is aligned, please file a
bug report or patch at savannah!
I thought of that, but it depends on "what" you want aligned when the
pbuf is created--the Ethernet frame itself or the actual payload within
the TCP frame. Supposed I filled the TCP frame with lots of 32-bit
aligned data from within mainline software but then used a 16-bit
aligned DMA to move the frame to the mac (or a zero-copy mac that can
access individual bytes from memory).
The moral for me is that I actually see higher throughput by setting
MEM_ALIGNMENT=2, which guarantees that when the Ethernet header is
uncovered, it will be 32-bit aligned. Even though the TCP/IP headers are
unaligned, the final copy to the mac's transmit buffer is much faster if
the source pointer is 32-bit aligned, i.e. at the start of the actual
Ethernet frame.
The question is whether the final copy is what matters or the rest of
the processing: when the final copy is done in background by a DMA
engine, this might not even be harmful. While it is true that the
transfer takes longer, it only has to be faster than the previous frame
takes for sending. The only difference then is how long the DMA
transfer generates a background load on the RAM bus, and if it uses too
much RAM bandwitdth for the processor to work normally.
However, if the processor does the final copy (without a DMA enginge),
than it's a bad thing if the data is not aligned. But you should be
able to include a DMA engine in your FPGA, so...
Xilinx provides a gigabit mac with a built-in DMA (at an additional
cost of course), so I definitely have options. I could also definitely
write my own DMA, or for that matter, my own non-DMA Ethernet mac that
simply accepts and discards a two-byte pad. But all of that is outside
the scope (and priority) of my current effort. At the moment, I'm not
terribly concerned about Ethernet performance as long as it works and
isn't horrendously slow. My investigations into this issue came from
re-writing the horrible lwIP driver provided by Xilinx. By re-writing
the code in a reasonably intelligent manner, I managed to increase the
throughput 4x along with making the system more stable. C-code is
easier to change than VHDL . . .
Btw, this is also assuming the outgoing data is copied into the stack
such that all the outgoing pbufs are PBUF_RAM-type.
Single PBUF_RAM pbufs or chained pbufs?
Single PBUF_RAM pbufs. Looking through the TCP code, if the data is
being copied into the stack (i.e. via NETCONN_COPY), I'm not even sure
how chained pbufs would be created (assuming malloc returns a block big
enough for an Ethernet frame).
Interesting results, but pretty esoteric since this is not an oft-used
platform (MicroBlaze w/ xps_ethernetlite IP core).
Not that different to my own platform ;-) And after all, we need
examples for task #7896 (Support zero-copy drivers) and this
is one example to start with.
I wouldn't say the system I'm using (at the moment at least) is
zero-copy because once I receive the frame from lwIP, I pbuf_ref it,
queue it up for transmit, and then eventually copy its payload to the
mac's transmit buffer, after which I do a pbuf_free. Although I guess
this is still zero-copy from the stack's frame of reference . . . its
probably worth distinguishing somewhere between zero-copy macs and
zero-copy drivers.
Tyrel
Simon
_______________________________________________
lwip-users mailing list
address@hidden
http://lists.nongnu.org/mailman/listinfo/lwip-users
|