bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#10877: Wimpy external files.


From: Rogier Wolff
Subject: bug#10877: Wimpy external files.
Date: Sat, 25 Feb 2012 13:56:42 +0100
User-agent: Mutt/1.5.20 (2009-06-14)

On Fri, Feb 24, 2012 at 09:28:02AM -0800, Paul Eggert wrote:
> I think 'sort' by default uses 1/8 of physical memory.
> It sounds like that calculation isn't working on your host;
> if so, it'd be nice if you could dive into the code and see why.


OK. I've diven in to the code a bit and after some more
instrumentation I get:

assurancetourix:~/sort/sortproblem/coreutils-8.5/src> tail -n +10 du.unsorted | 
./sort
sort_buffer_size: bound=0M
unknown filesize sortsize=0M
unknown filesize guessing=1M
default sort size: avail=217.234375M, total=8001.207031M, mem=1000.150879M
size is now: 1000M
after rlimit size is now: 1000M
after margin size is now: 500M
after rlimitrss size is now: 500M
returning: 500M
got size_bound=500M / file_size = 1M
worstcase=49M  (worst_case_per_input_byte = 49
returning  size=49M

The core of the problem is that when the filesize cannot be determined
(as in this case because it's a pipe), it guesses the filesize to be
1M, and then uses "worstcase" calculations to come to "about 50M" for
the buffer size.

This happens in the function around line 1380 in sort.c

I don't think that any guessing should be done if we cannot determine
the filesize. In that case we have great heuristics to come up with a
reasonable buffer size without the filesize.

Similarly, there is a logic error in the code that determines the
maximum memory to use: You said it was supposed to use 1/8th of total
memory. However it then takes another factor of two "margin". I think
the margin should not be applied to "total mem/8". (the quick fix is
to just turn that "8" in the code into a "4".)

Similarly, I think that it is reasonable to use "all available"
memory. This "all available" is really free memory, so there should
not be any risk in trying to use it all. Of course there are RSS
limits. Those should apply, with the appropriate margin. 

Here is my suggestion for default_sort_size. I've left the
instrumentation/debug in there. (I don't understand why
SIZE_MAX/1024/1024 evaluates to -1 on my machine, but I don't really
care. It seems to work....).


/* Return the default sort size.  */
static size_t
default_sort_size (void)
{
  /* structure of this function: 
   *
   * We start with too big a number, and then "filter" it as we see
   * fit.
   */ 
     
  double mem, total, avail;
  struct rlimit rlimit;
  size_t size; 

  /* Let SIZE be MEM, but no more than the maximum object size or
     system resource limits.  Avoid the MIN macro here, as it is not
     quite right when only one argument is floating point.  Don't
     bother to check for values like RLIM_INFINITY since in practice
     they are not much less than SIZE_MAX.  */

  size = SIZE_MAX;

  if (getrlimit (RLIMIT_DATA, &rlimit) == 0 && rlimit.rlim_cur < size)
    size = rlimit.rlim_cur;
  fprintf (stderr, "after rlimit size is now: %dM\n", size/1024/1024);

#ifdef RLIMIT_AS
  if (getrlimit (RLIMIT_AS, &rlimit) == 0 && rlimit.rlim_cur < size)
    size = rlimit.rlim_cur;
#endif

  /* Leave a large safety margin for the above limits, as failure can
     occur when they are exceeded.  */
  size /= 2;
  fprintf (stderr, "after margin size is now: %dM\n", size/1024/1024);

#ifdef RLIMIT_RSS
  /* Leave a 1/16 margin for RSS to leave room for code, stack, etc.
     Exceeding RSS is not fatal, but can be quite slow.  */
  if (getrlimit (RLIMIT_RSS, &rlimit) == 0 && rlimit.rlim_cur / 16 * 15 < size)
    size = rlimit.rlim_cur / 16 * 15;
#endif
  fprintf (stderr, "after rlimitrss size is now: %dM\n", size/1024/1024);

  /* Let MEM be available memory or 1/8 of total memory, whichever
     is greater.  */
  avail = physmem_available ();
  total = physmem_total ();
  mem = MAX (avail, total / 8);
  fprintf (stderr, "default sort size: avail=%fM, total=%fM, mem=%fM\n", 
        avail/1024/1024, total/1024/1024, mem/1024/1024); 

  if (mem < size)
    size = mem;
  fprintf (stderr, "size is now: %dM\n", size/1024/1024);

  /* Here is the odd one out: If we've reduced the number too far,
     we'll increase it again here to the minimum value:
     MIN_SORT_SIZE */

  if (MIN_SORT_SIZE > size) 
    size = MIN_SORT_SIZE;

  fprintf (stderr, "returning: %dM\n", MAX (size, MIN_SORT_SIZE)/ 1024/1024);
  /* Use no less than the minimum.  */
  return size;
}


        Roger. 

-- 
** address@hidden ** http://www.BitWizard.nl/ ** +31-15-2600998 **
**    Delftechpark 26 2628 XH  Delft, The Netherlands. KVK: 27239233    **
*-- BitWizard writes Linux device drivers for any device you may have! --*
The plan was simple, like my brother-in-law Phil. But unlike
Phil, this plan just might work.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]