coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Multithreaded sort hangs on Solaris


From: Pádraig Brady
Subject: Re: Multithreaded sort hangs on Solaris
Date: Wed, 13 Mar 2013 17:25:12 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130110 Thunderbird/17.0.2

On 03/13/2013 02:18 PM, McFarland, Jeffrey wrote:
> Here are the values from another sort that has been running for over 12 hours 
> now.  This time that second argument (number of threads) looks fine in all 
> three cases.  And this time there are no zombie threads.
> 
>> : pstack 20632
> 20632:  /usr/local/abacus/etsort/sort -tn -S 295063 --batch-size=100  -T 
> /disk/
> -----------------  lwp# 1 / thread# 1  --------------------
>  ffffffff7eadc810 lwp_wait (f2, ffffffff7fffea9c)
>  ffffffff7ead4d74 _thrp_join (f2, 0, 0, 1, ffffffff7fffeca0, 
> ffffffff7fffea9c) + 38
>  000000010000f2f4 sortlines (110137e90, 8, 7194a, 11015bfe0, 
> ffffffff7fffeca0, 100136240) + 174
>  0000000100010144 sort (100137cd0, 1, ffffffff7ffff660, 8, ffffffff7fffeeac, 
> ffffffff7ed00200) + 2f0
>  0000000100012bf4 main (13, ffffffff7ffff1f8, ffffffff7ffff298, 100136ca8, 
> 100000000, ffffffff7ed00200) + 21cc
>  0000000100004ca4 _start (0, 0, 0, 0, 0, 0) + 7c
> -----------------  lwp# 242 / thread# 242  --------------------
>  ffffffff7eadc810 lwp_wait (f4, ffffffff7e1fbd2c)
>  ffffffff7ead4d74 _thrp_join (f4, 0, 0, 1, ffffffff7fffeca0, 
> ffffffff7e1fbd2c) + 38
>  000000010000f2f4 sortlines (110137e90, 4, 7194a, 11015c050, 
> ffffffff7fffeca0, 100136240) + 174
>  000000010000f168 sortlines_thread (ffffffff7fffeb60, 1fc000, 0, 0, 
> 10000f104, 0) + 64
>  ffffffff7ead8778 _lwp_start (0, 0, 0, 0, 0, 0)


> -----------------  lwp# 244 / thread# 244  --------------------
>  ffffffff7ead8818 lwp_park (0, 0, 0)
>  000000010000e710 lock_node (11015c360, 10f691fb0, ffffffff7ec4a300, 
> ffffffff7fffecac, ffffffff7ed00a00, 0) + 14
>  000000010000efbc queue_check_insert_parent (ffffffff7fffeca0, 11015c3d0, 
> 100136240, 1101597dd, ffffffff7ed00a00, 1c00) + 2c
>  000000010000f0e8 merge_loop (ffffffff7fffeca0, 7194a, 100136240, 1101597dd, 
> ffffffff7eacff0c, 3) + 90
>  000000010000f43c sortlines (110137e90, 2, 7194a, 11015c0c0, 
> ffffffff7fffeca0, 100136240) + 2bc
>  000000010000f168 sortlines_thread (ffffffff7e1fbdf0, 1fc000, 0, 0, 
> 10000f104, 0) + 64
>  ffffffff7ead8778 _lwp_start (0, 0, 0, 0, 0, 0)

Looks like a deadlock, but may be triggered by stack corruption,
as the failure modes vary.
Would it be possible to annotate lock_node() with that attached.
This should verify we're at least not missing an unlock() somewhere.
You can then capture the annotations by adding '2> locks' at the end of the 
command.

thanks,
Pádraig.

Attachment: sort-lock-annotate.diff
Description: Text Data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]