[lwip-users] lwip lock

Hi to all.

I have created an avr32 application based on FreeRtos and LWIP 1.3.2

My application is very huge but I want to concentrate on my problems.

There are 2 tasks that use http connection: a web server and a web client versus an external portal.

The application simply collects some data and, periodically, POST them to an apache based web portal.

The web server is of course alive only when a browser wants to connect otherwise is almost frozen in a listen status.

Here is my problem.

Sometime and somehow all the tcp connections are locked and lost: the web server is no more accessible and the application cannot communicate to the portal.

This seems to happen while I try to access to the web server and, in the same time, the device tries to access to the portal.

I started to analyze the lwip and here is what I found.

In file mem.c I added the following code

static u8_t *ram;

/** the last entry, always unused! */

static struct mem *ram_end;

/** pointer to the lowest free block, this is used for faster search */

static struct mem *lfree;

u8_t ** ppMemRam; // DT:2011/03/09

/** the last entry, always unused! */

struct mem ** ppMemRamEnd; // DT:2011/03/09

/** pointer to the lowest free block, this is used for faster search */

struct mem ** ppMemLFree; // DT:2011/03/09

...

void

mem_init(void)

{

...

ppMemRam = & ram; // DT:2011/03/09

ppMemRamEnd = & ram_end; // DT:2011/03/09

ppMemLFree = & lfree; // DT:2011/03/09

}

This permits to me to see (through a serial debugger) the status of the heap area for the lwip data.

When problems happen, the "lfree" pointer is stacked at an address different to "ram"

I tried to look ad the mem ram area and I found that the chain of the various allocation was ok.

It seems that there was something not freed for some (for me) unknown reason.

Sometimes this is not critical because the access is ok but the wasted area grows up little by little saturating the area and locking the communication.

I suppose this is not a cause but an effect so a continue my analysis.

I concentrate on the memp area

I study it being sure I didn't understand so much but, anyway, here is what I discovered.

I show only the TCP_SEG area that seems relevant to me.

HEX Offset Delta Block Arg RefCh RefMem Free

1E08 2564 20 0 TCP_SEG 0

1E1C 2584 20 1 TCP_SEG 1E08

1E30 2604 20 2 TCP_SEG 1E1C

1E44 2624 20 3 TCP_SEG 1E30

1E58 2644 20 4 TCP_SEG 1E44

1E6C 2664 20 5 TCP_SEG 1E58

1E80 2684 20 6 TCP_SEG 1E6C

1E94 2704 20 7 TCP_SEG ? 1EE4

1EA8 2724 20 8 TCP_SEG ? 0

1EBC 2744 20 9 TCP_SEG 1E80 - xxx

1ED0 2764 20 10 TCP_SEG ? 0

1EE4 2784 20 11 TCP_SEG ? 0

I try to describe...

HEX is the absolute address in memory of the memp block

Offset is the absolute offset in byte from the top of the whole memp structure

Delta is the sizeof the single block

Block is the index of the block

RefCh is the address of the "next" block chained

RefMem is the address of the "next" block found surfing the memory

Free is the first free block

What seems is that the block 9 is the first free. The next one is the 6th, then 5th, 4th, 3, 2, 1, 0

Reading the memory I have seen that there is the block 7 chained to block 11. These two blocks are chained but no more reachable.

Again block 10 and 8 seems to be no more reachable and chained to nothing.

What I see is that these two phenomena are related: when I loose mem area I lose TCP_SEG blocks as well

If we take a look at the tcp_seg structure

struct tcp_seg {

struct tcp_seg *next; /* used when putting segements on a queue */

struct pbuf *p; /* buffer containing data + TCP header */

...

we can see that there is a reference to pbuf. The lost tcp_seg blocks do refers to that lost mem area!

Anyone has ever seen such a problem?

Any suggestion on how to solve it?

I read also the stats of the lwip memp

lwip_stats.memp[i].max

lwip_stats.memp[i].avail

lwip_stats.memp[i].used

and what I found is, for TCP_SEG, even 12, 12, 12 so all memp block used!

I have one idea but I don't know if this maybe can create worst problems. This is not a solution because I don't know the real problem but it is a sort of sanity of the TCP_SEG blocks.

Looking at the example above posted I can chain the two lost blocks (10 and 8 ) to the top of the list and the chained blocks (7 and 11) to the bottom of the list. In this way I can recover at least the lost blocks. The chained blocks (7 and 11), in theory, can be still used and freed or, at least, I don't know if they are really used or lost.

So, the result should be

7(chained) -> 11(lost) -> 9 (free) -> 6 -> 5 -> 4 -> 3 -> 2 -> 1 -> 0 -> 8 (lost)-> 10(lost)

This, of course, must be done by hand.

For block 8 and 10 I suppose I have to call also the mem_free function on the block->p area.

Is it a good idea?

Again, does anybody know the problem or what the hell I have done to create this problem?

Another problem. I don't know if it is related; maybe it is the same problem but with a different effects.

The tcp_thread stalls!

static void

tcpip_thread(void *arg)

{

...

while (1) { /* MAIN Loop */

gusTcpThread ++; // DT 03/03/2011 Debug

gucStatusTCPIP = 0; //DT 2011/03/04 TEST

sys_mbox_fetch(mbox, (void *)&msg);

gucStatusTCPIP = 1; //DT 2011/03/04 TEST

switch (msg->type) {

#if LWIP_NETCONN

case TCPIP_MSG_API:

LWIP_DEBUGF(TCPIP_DEBUG, ("tcpip_thread: API message %p\n", (void *)msg));

gucStatusTCPIP = 2; //DT 2011/03/04 TEST

msg->msg.apimsg->function(&(msg->msg.apimsg->msg));

gucStatusTCPIP = 3; //DT 2011/03/04 TEST

break;

#endif /* LWIP_NETCONN */

...

}

What I see is that the gusTcpThread counter is stopped. In this case the debug variable gucStatusTCPIP is 2 so that it stalls in the call of the api function. I don't know which one and which mbox is related to.

// Posts the "msg" to the mailbox. This function have to block until the "msg"

// is really posted.

void sys_mbox_post(sys_mbox_t mbox, void *msg)

{

// NOTE: we assume mbox != SYS_MBOX_NULL; iow, we assume the calling function

// takes care of checking the mbox validity before calling this function.

while( pdTRUE != xQueueSend( mbox, &msg, SYS_ARCH_BLOCKING_TICKTIMEOUT ) )

{

vTaskDelay(10); // DT 08/03/2011 Debug

gusCntMBoxFull++; // DT 03/03/2011 Debug

}

gusCntMBoxFull = 0; // DT 03/03/2011 Debug

}

In the normal case the variable gusCntMBoxFull is supposed to be 0. If the tcp thread is locked (the only one that can pop the queue) the queue is continuously filled till its own fullness and that while loop is an infinite loop.

Any idea? Do you think these two problems are the same problem with two different effects? Consider that also this problem happens in the same situation: web server and portal both on.

Last information. I have the optimization o1. I am going to try the optimization o0 but I have to remove pieces of code so, it is not a simple job.

Best regards

Davide

From:	Tazzari Davide
Subject:	[lwip-users] lwip lock
Date:	Tue, 22 Mar 2011 16:41:41 +0100