coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Multithreaded sort hangs on Solaris


From: McFarland, Jeffrey
Subject: RE: Multithreaded sort hangs on Solaris
Date: Tue, 12 Mar 2013 16:22:37 +0000

Honestly the sort command is generated by another script so I'm not sure why 
the `sort -t\n` syntax was chosen.  However that part seems to work as it is.  
It breaks on newlines as it should.

Where are you suggesting adding some sleeps?  I haven't gotten into the sort 
code and I'm not sure that I'll a lot more time to put into it.

I have noticed  a couple more related oddities.  First, I found that even 
though I set the batch-size to 100 it always creates 104 files when parallel is 
not set to 1.  It creates 103 files of the same size then then starts merging 
them into the 104th file, then finally into the final file.  When parallel is 
set to 1 then it creates only 95 temp files.  Secondly, I have tested this on 3 
machines now (all with the same OS) and I've noticed up to a 15% increase in 
performance when running with parallel set to 1.

-----Original Message-----
From: Pádraig Brady [mailto:address@hidden]
Sent: Tuesday, March 12, 2013 5:07 AM
To: McFarland, Jeffrey
Cc: address@hidden
Subject: Re: Multithreaded sort hangs on Solaris

On 03/11/2013 03:47 PM, McFarland, Jeffrey wrote:
> I have come across some odd results regarding the sort utility in coreutils 
> version 8.20.  I've looked through the archives and don't see any similar 
> issues so it may be something specific to our systems.
>
>
>
> System:  SunOS 5.10 Generic_147440-26 sun4u sparc SUNW,Sun-Fire-V890
>
>
>
> Issue:  When running sort on a 22.5 GB file I found that about 1 out of 10 
> times the process seems to hang (out of 100+ tests).  The process is still 
> running but the temp files are no longer changing and the final file either 
> has not been created or is a 0 byte file.  When this happens the temp files 
> are never in the exact same state as a previous run.  On this machine a 
> complete sort normally takes about 20 minutes.  On one occasion the process 
> hung for over 48 hours before I killed it.  Running top shows no significant 
> load on the system.
>
>
>
> Command run:
>
> ./sort -t\n -S 256M --batch-size=100 -T /disk/craiwk01/prod/SORTWK -T
> /disk/craiwk02/prod/SORTWK -T /disk/craiwk03/prod/SORTWK -T
> /disk/craiwk04/prod/SORTWK -T /disk/craiwk06/prod/SORTWK -k1.1,1.10
> infile -o infile.sorted
>
>
>
>>: ps
>
>    PID TTY         TIME CMD
>
> 16328 pts/3       5:06 sort
>
>         12697 pts/3       0:00 ps
>
>
>
>>: sudo truss -rall -wall -f -p 16328
>
> 16328:  lwp_park(0x00000000, 0)         (sleeping...)
>
>
>
>>: sudo pstack 16328
>
> 16328:  /usr/local/abacus/etsort/sort -tn -S 295063 --batch-size=100
> -T /disk/
>
> -----------------  lwp# 1 / thread# 1  --------------------
>
> ffffffff7d4d8818 lwp_park (0, 0, 0)
>
> 0000000100009c74 sortlines (111b56580, 111c56080, ffffffff7fffeab0,
> 10012a321, ffffffff7fffead0, 10012a328) + 514
>
> 000000010000a5cc sortlines (111558380, 2, ffffffff7fffeab0, 1121765e0,
> 0, ffffffff7fffeab0) + e6c
>
> 000000010000a5cc sortlines (111956f80, 4, ffffffff7fffeab0, 112176420,
> 0, ffffffff7fffeab0) + e6c
>
> 000000010000a5cc sortlines (112154760, 8, ffffffff7fffeab0, 1121760a0,
> 1, ffffffff7fffeab0) + e6c
>
> 000000010000c070 sort (10012a740, 0, ffffffff7fffead0, 23, 10012cddd,
> 112154760) + 350
>
> 000000010000e6e8 main (13, ffffffff7ffff148, 0, 10012c220, fffd,
> 10012b1e0) + 1ee8
>
> 00000001000041bc _start (0, 0, 0, 0, 0, 0) + 7c
>
> -----------------  lwp# 240 / thread# 240  --------------------
>
> 000000010000a600 sortlines_thread(), exit value = 0x0000000000000000
>
>         ** zombie (exited, not detached, not yet joined) **
>
> -----------------  lwp# 241 / thread# 241  --------------------
>
> 000000010000a600 sortlines_thread(), exit value = 0x0000000000000000
>
>         ** zombie (exited, not detached, not yet joined) **
>
> -----------------  lwp# 242 / thread# 242  --------------------
>
> 000000010000a600 sortlines_thread(), exit value = 0x0000000000000000
>
>         ** zombie (exited, not detached, not yet joined) **
>
>
>
> If I change the sort to run as a single threaded process (add "--parallel=1" 
> to above command) then it doesn't hang.  This makes me think that it's most 
> likely a threading issue.  I ran the same tests on a LINUX machine and it did 
> not have the same hanging issue so it's most likely only an issue with 
> Solaris.
>
>
>
> I initially found this issue using coreutils 8.9 and I changed to 8.20 to see 
> if a fix had been made but no luck.
>
>
>
> Is this a known issue?  Are there any additional tests I should run to 
> further narrow down this issue?

I can't think of anything TBH.
There may possibly be some portability issues with --compress and --parallel 
(due to possibly non async safe functions being called after a fork), but 
you're not using --compress, so we can discount that at least.

No matter if the bug is in coreutils or solaris, adding some sleeps may help 
trigger a race more quickly?

BTW the `sort -t\n` looks strange. Did you mean: sort -t$'\n' ?

thanks,
Pádraig.

________________________________

This e-mail and files transmitted with it are confidential, and are intended 
solely for the use of the individual or entity to whom this e-mail is 
addressed. If you are not the intended recipient, or the employee or agent 
responsible to deliver it to the intended recipient, you are hereby notified 
that any dissemination, distribution or copying of this communication is 
strictly prohibited. If you are not one of the named recipient(s) or otherwise 
have reason to believe that you received this message in error, please 
immediately notify sender by e-mail, and destroy the original message. Thank 
You.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]