bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#9086: ls --color: 30% speed-up and case-insensitive suffixes [Re: bu


From: Jim Meyering
Subject: bug#9086: ls --color: 30% speed-up and case-insensitive suffixes [Re: bug#9086
Date: Tue, 09 Oct 2012 13:52:37 +0200

Eric Blake wrote:
> On 07/26/2011 02:33 PM, marcel partap wrote:
>> Here's a patch. Adds STRCASEEQ_LEN macro for case insensitive extension
>> matching.
>> #regards/marcel.
>
> Your patch would make the new behavior unconditional.  But I like case
> sensitivity, and think that case insensitivity should be an opt-in
> process that I request, with coordination between dircolors to
> generate a new string for LS_COLORS to be honored by ls.  Furthermore,
> the patch is lacking in NEWS, documentation, and testsuite coverage.
>
> Additionally, you should be aware that strncasecmp() has undefined
> behavior in non-C multibyte locales.  It would probably be better to
> use c_strncasecmp(), so that you are guaranteed defined behavior
> regardless of the current locale.
>
> Would you care to tackle those additional issues?  And are you set up
> for copyright assignment, since the patch will probably be non-trivial
> by that point in time?

I saw this while going back through old bugs, so started poking around.
How about this: if a suffix is not matched, convert it to lower case
and check again.  That way, anyone who cares to highlight .TaR or .TAR
differently from .tar can still easily do so, yet names ending with
uppercase .TAR, .JPG, .FLAC, etc. will now be highlighted by default.

Does anyone think it's worth making this new fallback-to-case-insensitivity
an option?

While looking at that, I realized that comparing each name in an ls'd
directory with each of nearly 100 suffixes should leave nontrivial room
for improvement.  Sure enough, once I'd replaced that linear search
with a hash-table lookup, I see a measurable improvement.

This is on a tmpfs file system, with 100,000 .c files in one directory,
created like this:

  seq --f %g.c 100000|xargs touch

Comparing old to new:

  $ for i in $(seq 10); do env time --f=%e ls --color=always > /t/old; done
  0.35
  0.34
  0.48
  0.36
  0.35
  0.35
  0.35
  0.35
  0.36
  0.35

  $ for i in $(seq 10);do env time --f=%e /cu/src/ls --color=always >/t/new;done
  0.24
  0.24
  0.23
  0.23
  0.23
  0.33
  0.23
  0.23
  0.24
  0.23

and taking best-of-10 times, I see a 0.34 -> 0.23 (~30%) improvement.
Of course, I have deliberately used the case that shows the most improvement:
  - a directory with very many files
  - no suffix match, so the old code searches all suffixes
  - using tmpfs: minimal stat overhead

In general, the improvement will be smaller.

The first patch is the O(100-strncmp) -> O(hash-lookup) speed-up.
The second adds the case-insensitive fallback.


>From be82e6866e35320a3c228d7c8af5416a31c6a14e Mon Sep 17 00:00:00 2001
From: Jim Meyering <address@hidden>
Date: Mon, 8 Oct 2012 19:06:18 +0200
Subject: [PATCH 1/2] ls: --color: speed up suffix-coloring

Prior to this change, when coloring a file name based on its suffix,
and with default dircolors-generated LS_COLORS, ls would compare the
file name to each of nearly 100 suffixes.  Convert that process to
use a hash-table lookup.

On a tmpfs file system, with 100,000 .c files in one directory,
created via "seq --f %g.c 100000|xargs touch", compare old to new:

  $ for i in $(seq 10);do env time --f=%e ls --col=always >/t/old; done
  ...
  0.34
  $ for i in $(seq 10);do env time --f=%e /t/ls --col=al >/t/new; done
  ...
  0.23

Taking best-of-10 times, that's a 0.34 -> 0.23 (~30%) improvement.

* src/ls.c: Include "hash-pjw.h".
(color_ext_type): Remove global.
(suffix_map, struct sufco): Declare.
(sufco_hasher, sufco_comparator, suffix_color_init): New functions.
(suffix_color_insert, suffix_color_lookup): New functions.
(main): Call suffix_color_init.
(parse_ls_color): Rather than creating a linked list of
suffix/color-code pairs, insert each pair into our new suffix_map.
(print_color_indicator): Use map-lookup, rather than linear search.
* NEWS (Improvements): Mention it.
---
 NEWS     |   2 +
 src/ls.c | 157 +++++++++++++++++++++++++++++++++++++++++++--------------------
 2 files changed, 110 insertions(+), 49 deletions(-)

diff --git a/NEWS b/NEWS
index aff5bf1..16d7dc8 100644
--- a/NEWS
+++ b/NEWS
@@ -57,6 +57,8 @@ GNU coreutils NEWS                                    -*- 
outline -*-
   deterministic primality test for each prime factor, not just a
   probabilistic test.

+  ls --color is as much as 30% faster
+
   seq is now up to 70 times faster than it was in coreutils-8.19 and prior,
   but only with non-negative whole numbers, an increment of 1, and no
   format-changing options.
diff --git a/src/ls.c b/src/ls.c
index 106d234..acebd11 100644
--- a/src/ls.c
+++ b/src/ls.c
@@ -91,6 +91,7 @@
 #include "filenamecat.h"
 #include "hard-locale.h"
 #include "hash.h"
+#include "hash-pjw.h"
 #include "human.h"
 #include "filemode.h"
 #include "filevercmp.h"
@@ -599,9 +600,6 @@ static struct bin_str color_indicator[] =
     { LEN_STR_PAIR ("\033[K") },       /* cl: clear to end of line */
   };

-/* FIXME: comment  */
-static struct color_ext_type *color_ext_list = NULL;
-
 /* Buffer for color sequences */
 static char *color_buf;

@@ -1018,6 +1016,83 @@ dired_dump_obstack (const char *prefix, struct obstack 
*os)
     }
 }

+/* ********************************************************************* */
+/* Map each suffix from LS_COLORS to its escape code.
+   A filename's suffix is the key, the value is a byte sequence
+   with which to color-highlight the corresponding name.  */
+static Hash_table *suffix_map;
+
+/* Initial size of that table.  */
+enum { INITIAL_SUFFIX_COLOR_TABLE_SIZE = 131 };
+
+/* An entry in the suffix_map table.  */
+struct sufco
+  {
+    char *suffix;              /* file name suffix */
+    struct bin_str color_code; /* terminal codes to elicit desired color */
+  };
+
+static size_t
+sufco_hasher (void const *entry, size_t table_size)
+{
+  struct sufco const *s = entry;
+  return hash_pjw (s->suffix, table_size);
+}
+
+static bool
+sufco_comparator (void const *e1, void const *e2)
+{
+  struct sufco const *s1 = e1;
+  struct sufco const *s2 = e2;
+  return STREQ (s1->suffix, s2->suffix);
+}
+
+/* Initialize a hash table to implement our suffix/color-code map.  */
+static void
+suffix_color_init (void)
+{
+  suffix_map = hash_initialize (INITIAL_SUFFIX_COLOR_TABLE_SIZE, NULL,
+                                sufco_hasher,
+                                sufco_comparator,
+                                NULL);
+  if (suffix_map == NULL)
+    xalloc_die ();
+}
+
+/* Insert S, a malloc'd suffix/color-code pair.  If the specified suffix is
+   initially in the table, change the mapping to refer to the new entry.
+   Either insert S or, when its suffix is a duplicate, record this new
+   color code in the table and free S's memory.  */
+static bool
+suffix_color_insert (struct sufco *s)
+{
+  struct sufco *match = NULL;
+  if (hash_insert_if_absent (suffix_map, s, (const void **) &match) < 0)
+    return false;
+
+  if (match)
+    {
+      /* This suffix is already in the table.
+         Incoming color code takes precedence.  */
+      match->color_code.len = s->color_code.len;
+      match->color_code.string = s->color_code.string;
+      free (s);
+    }
+
+  return true;
+}
+
+/* Map a suffix string from LS_COLORS to its color code bytes.  */
+static struct sufco *
+suffix_color_lookup (char const *suffix)
+{
+  struct sufco s;
+  s.suffix = (char *) suffix;
+  return hash_lookup (suffix_map, &s);
+}
+
+/* ********************************************************************* */
+
 /* Read the abbreviated month names from the locale, to align them
    and to determine the max width of the field and to truncate names
    greater than our max allowed.
@@ -1287,7 +1362,10 @@ main (int argc, char **argv)
   i = decode_switches (argc, argv);

   if (print_with_color)
-    parse_ls_color ();
+    {
+      suffix_color_init ();
+      parse_ls_color ();
+    }

   /* Test print_with_color again, because the call to parse_ls_color
      may have just reset it -- e.g., if LS_COLORS is invalid.  */
@@ -2100,7 +2178,7 @@ decode_switches (int argc, char **argv)
 }

 /* Parse a string as part of the LS_COLORS variable; this may involve
-   decoding all kinds of escape characters.  If equals_end is set an
+   decoding all kinds of escape characters.  If equals_end is set, an
    unescaped equal sign ends the string, otherwise only a : or \0
    does.  Set *OUTPUT_COUNT to the number of bytes output.  Return
    true if successful.
@@ -2324,7 +2402,7 @@ parse_ls_color (void)
   char *buf;                   /* color_buf buffer pointer */
   int ind_no;                  /* Indicator number */
   char label[3];               /* Indicator label */
-  struct color_ext_type *ext;  /* Extension we are working on */
+  struct sufco *ext;           /* suffix/color-code we are working on */

   if ((p = getenv ("LS_COLORS")) == NULL || *p == '\0')
     return;
@@ -2351,20 +2429,18 @@ parse_ls_color (void)
               break;

             case '*':
-              /* Allocate new extension block and add to head of
-                 linked list (this way a later definition will
-                 override an earlier one, which can be useful for
-                 having terminal-specific defs override global).  */
-
-              ext = xmalloc (sizeof *ext);
-              ext->next = color_ext_list;
-              color_ext_list = ext;
+              {
+                ext = xmalloc (sizeof *ext);
+                ext->suffix = buf;

-              ++p;
-              ext->ext.string = buf;
+                ++p;
+                size_t suff_len;
+                state = (get_funky_string (&buf, &p, true, &suff_len)
+                         ? PS_4 : PS_FAIL);

-              state = (get_funky_string (&buf, &p, true, &ext->ext.len)
-                       ? PS_4 : PS_FAIL);
+                /* Insert '\0', to NUL-terminate ext->suffix.  */
+                *buf++ = 0;
+              }
               break;

             case '\0':
@@ -2411,9 +2487,13 @@ parse_ls_color (void)
         case PS_4:             /* Equal sign after *.ext */
           if (*(p++) == '=')
             {
-              ext->seq.string = buf;
-              state = (get_funky_string (&buf, &p, false, &ext->seq.len)
+              ext->color_code.string = buf;
+              state = (get_funky_string (&buf, &p, false, &ext->color_code.len)
                        ? PS_START : PS_FAIL);
+
+              if ( ! suffix_color_insert (ext))
+                goto done;
+              ext = NULL;
             }
           else
             state = PS_FAIL;
@@ -2430,18 +2510,10 @@ parse_ls_color (void)

   if (state == PS_FAIL)
     {
-      struct color_ext_type *e;
-      struct color_ext_type *e2;
-
       error (0, 0,
              _("unparsable value for LS_COLORS environment variable"));
       free (color_buf);
-      for (e = color_ext_list; e != NULL; /* empty */)
-        {
-          e2 = e;
-          e = e->next;
-          free (e2);
-        }
+      free (ext);
       print_with_color = false;
     }

@@ -4265,10 +4337,6 @@ print_type_indicator (bool stat_ok, mode_t mode, enum 
filetype type)
 static bool
 print_color_indicator (const struct fileinfo *f, bool symlink_target)
 {
-  enum indicator_no type;
-  struct color_ext_type *ext;  /* Color extension */
-  size_t len;                  /* Length of name */
-
   const char* name;
   mode_t mode;
   int linkok;
@@ -4287,6 +4355,7 @@ print_color_indicator (const struct fileinfo *f, bool 
symlink_target)

   /* Is this a nonexistent file?  If so, linkok == -1.  */

+  enum indicator_no type;
   if (linkok == -1 && is_colored (C_MISSING))
     type = C_MISSING;
   else if (!f->stat_ok)
@@ -4343,21 +4412,11 @@ print_color_indicator (const struct fileinfo *f, bool 
symlink_target)
     }

   /* Check the file's suffix only if still classified as C_FILE.  */
-  ext = NULL;
-  if (type == C_FILE)
-    {
-      /* Test if NAME has a recognized suffix.  */
-
-      len = strlen (name);
-      name += len;             /* Pointer to final \0.  */
-      for (ext = color_ext_list; ext != NULL; ext = ext->next)
-        {
-          if (ext->ext.len <= len
-              && STREQ_LEN (name - ext->ext.len, ext->ext.string,
-                            ext->ext.len))
-            break;
-        }
-    }
+  char *suffix;
+  struct sufco *ext;   /* Color extension */
+  ext = (type == C_FILE && (suffix = strrchr (name, '.'))
+         ? suffix_color_lookup (suffix)
+         : NULL);

   /* Adjust the color for orphaned symlinks.  */
   if (type == C_LINK && !linkok)
@@ -4368,7 +4427,7 @@ print_color_indicator (const struct fileinfo *f, bool 
symlink_target)

   {
     const struct bin_str *const s
-      = ext ? &(ext->seq) : &color_indicator[type];
+      = ext ? &(ext->color_code) : &color_indicator[type];
     if (s->string != NULL)
       {
         /* Need to reset so not dealing with attribute combinations */
--
1.8.0.rc1


>From 30a762cacf2031ac4431b427a195c9f535c745e2 Mon Sep 17 00:00:00 2001
From: Jim Meyering <address@hidden>
Date: Mon, 8 Oct 2012 10:05:32 +0200
Subject: [PATCH 2/2] ls: --color suffix-match: fall back to case-insensitive
 look-up

* src/ls.c: FIXME
* tests/ls/color-icase.sh: Test the change.
* tests/local.mk (all_tests): Add it.
* NEWS (Changes in behavior): Mention it.
Suggested by SciFi in http://bugs.gnu.org/9086
---
 NEWS                    |  5 +++++
 src/ls.c                | 27 ++++++++++++++++++++++++++-
 tests/local.mk          |  1 +
 tests/ls/color-icase.sh | 31 +++++++++++++++++++++++++++++++
 4 files changed, 63 insertions(+), 1 deletion(-)
 create mode 100755 tests/ls/color-icase.sh

diff --git a/NEWS b/NEWS
index 16d7dc8..8364fd4 100644
--- a/NEWS
+++ b/NEWS
@@ -46,6 +46,11 @@ GNU coreutils NEWS                                    -*- 
outline -*-

 ** Changes in behavior

+  ls --color's suffix-matching code now uses a case-insensitive fall-back,
+  which means a file named e.g., "PKG.TAR" is now colored red, just like
+  "pkg.tar", but if you prefer to highlight names ending in .TAR differently
+  from those ending in .tar you may.
+
   nproc now diagnoses with an error, non option command line parameters.

 ** Improvements
diff --git a/src/ls.c b/src/ls.c
index acebd11..3a1ba3c 100644
--- a/src/ls.c
+++ b/src/ls.c
@@ -86,6 +86,7 @@

 #include "acl.h"
 #include "argmatch.h"
+#include "c-ctype.h"
 #include "dev-ino.h"
 #include "error.h"
 #include "filenamecat.h"
@@ -1088,7 +1089,31 @@ suffix_color_lookup (char const *suffix)
 {
   struct sufco s;
   s.suffix = (char *) suffix;
-  return hash_lookup (suffix_map, &s);
+  struct sufco *e = hash_lookup (suffix_map, &s);
+  if (e)
+    return e;
+
+  size_t s_len = strlen (suffix);
+  /* If there is no uppercase byte, then stop here.  */
+  if (strcspn (suffix, "ABCDEFGHIJKLMNOPQRSTUVWXYZ") == s_len)
+    return NULL;
+
+  char *lower_case = malloc (s_len + 1);
+  if (lower_case == NULL)
+    return NULL;
+  char *p = lower_case;
+  while (*suffix)
+    {
+      *p++ = c_tolower (*suffix);
+      suffix++;
+    }
+  *p = 0;
+
+  s.suffix = lower_case;
+  e = hash_lookup (suffix_map, &s);
+  free (lower_case);
+
+  return e;
 }

 /* ********************************************************************* */
diff --git a/tests/local.mk b/tests/local.mk
index 486bf31..12c0ac2 100644
--- a/tests/local.mk
+++ b/tests/local.mk
@@ -516,6 +516,7 @@ all_tests =                                 \
   tests/ls/block-size.sh                       \
   tests/ls/color-clear-to-eol.sh               \
   tests/ls/color-dtype-dir.sh                  \
+  tests/ls/color-icase.sh                      \
   tests/ls/color-norm.sh                       \
   tests/ls/dangle.sh                           \
   tests/ls/dired.sh                            \
diff --git a/tests/ls/color-icase.sh b/tests/ls/color-icase.sh
new file mode 100755
index 0000000..726dee5
--- /dev/null
+++ b/tests/ls/color-icase.sh
@@ -0,0 +1,31 @@
+#!/bin/sh
+# after 8.19, ls --color recognized suffixes regardless of case
+
+# Copyright (C) 2012 Free Software Foundation, Inc.
+
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+. "${srcdir=.}/tests/init.sh"; path_prepend_ ./src
+print_ver_ ls
+
+color_code='01;31'
+
+echo x > red.TAR || framework_failure_
+printf '\e[0m\e['"$color_code"'mred.TAR\e[0m\n' > exp || fail=1
+
+LS_COLORS="*.tar=$color_code" ls --color=always red.TAR > out || fail=1
+
+compare exp out || fail=1
+
+Exit $fail
--
1.8.0.rc1





reply via email to

[Prev in Thread] Current Thread [Next in Thread]