findutils-patches
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Findutils-patches] [PATCH] updatedb: run in the C locale, don't do


From: Bernhard Voelker
Subject: Re: [Findutils-patches] [PATCH] updatedb: run in the C locale, don't do case-folding.
Date: Sun, 10 Jan 2016 00:58:21 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0

On 01/09/2016 10:18 PM, James Youngman wrote:
> * locate/updatedb.sh: Set LC_ALL to C to avoid unexpected character
> encodings in path names causing sort to fail (idea from Clarence
> Risher).  Don't do case-folding, since the character set in now C,
> which is likely inconsistent with the user's expectations anyway.
> Honour $TMPDIR. Correct the error message you get if you specify
> both --old-format and --dbformat.
> * NEWS: Explain these changes.
> ---
>  NEWS               |  7 +++++++
>  locate/updatedb.sh | 33 ++++++++++++++++++++++++---------
>  2 files changed, 31 insertions(+), 9 deletions(-)
> 
> diff --git a/NEWS b/NEWS
> index f72f021..8865b8e 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -2,6 +2,13 @@ GNU findutils NEWS - User visible changes.      -*- outline 
> -*- (allout)
>  
>  * Major changes in release 4.7.0-git, YYYY-MM-DD
>  
> +** Changes to locate / updatedb
> +
> +The updatedb script now operates in the C locale only.  This means
> +that character encoding issues are now not likely to cause sort to
> +fail.  It also honours the TMPDIR environment variable if that was
> +set, and no longer sorts file names case-insensitively.
> +
>  ** Translations
>  
>  Updated translations: Hungarian, Slovak, Dutch, German.
> diff --git a/locate/updatedb.sh b/locate/updatedb.sh
> index 9cb2811..3861915 100644
> --- a/locate/updatedb.sh
> +++ b/locate/updatedb.sh
> @@ -31,6 +31,19 @@ There is NO WARRANTY, to the extent permitted by law.
>  Written by Eric B. Decker, James Youngman, and Kevin Dalley.
>  '
>  
> +# File path names are not actually text, anyway (since there is no
> +# mechanism to enforce any constraint that the basename of a
> +# subdirectory has the same character encoding as the basename of its
> +# parent).  The practical effect is that, depending on the way a
> +# oarticular system is configured and the content of its filesystem,

s/oarticular/particular/

> +# passing all the file names in the system through "sort" may generate
> +# character encoding errors in text-based tools like "sort".  To avoid
> +# this, we set LC_ALL=C.  This will, presumably, not work perfectly on
> +# systems where LC_ALL is not the way to do locale configuration or
> +# some other seting can override this.
> +LC_ALL=C
> +export LC_ALL
> +

A less invasive change would be to apply the LC_ALL=C setting only for
the sort command.  As it's surprising anyway that the calling process
has set it to something different - because it's usually run via cron -
I agree with the general setting.

>  
>  usage="\
>  Usage: $0 [--findoptions='-option1 -option2...']
> @@ -75,7 +88,7 @@ done
>  
>  case "${dbformat:+yes}_${old}" in
>      yes_yes)
> -     echo "The --dbformat and --old cannot both be specified." >&2
> +     echo "The --dbformat and --old-format cannot both be specified." >&2
>       exit 1
>       ;;
>       *)
> @@ -186,12 +199,14 @@ test -z "$PRUNEREGEX" &&
>  : address@hidden@}
>  
>  # Directory to hold intermediate files.
> -if test -d /var/tmp; then
> -  : ${TMPDIR=/var/tmp}
> -elif test -d /usr/tmp; then
> -  : ${TMPDIR=/usr/tmp}
> -else
> -  : ${TMPDIR=/tmp}
> +if test -z "$TMPDIR"; then
> +  if test -d /var/tmp; then
> +    : ${TMPDIR=/var/tmp}
> +  elif test -d /usr/tmp; then
> +    : ${TMPDIR=/usr/tmp}
> +  else
> +    : ${TMPDIR=/tmp}
> +  fi
>  fi
>  export TMPDIR
>  
> @@ -320,7 +335,7 @@ if [ "$myuid" = 0 ]; then
>      exit $?
>    fi
>  fi
> -} | $sort -f | $frcode $frcode_options > $LOCATE_DB.n
> +} | $sort | $frcode $frcode_options > $LOCATE_DB.n
>  then
>      : OK so far
>      true
> @@ -387,7 +402,7 @@ if test -n "$NETPATHS"; then
>      exit $?
>    fi
>  fi
> -} | tr / '\001' | $sort -f | tr '\001' / > "$filelist"
> +} | tr / '\001' | $sort | tr '\001' / > "$filelist"
>  
>  # Compute the (at most 128) most common bigrams in the file list.
>  $bigram $bigram_opts < $filelist | sort | uniq -c | sort -nr |

This removes the case folding option of sort (which is still buggy in
many i18N implementations, and therefore maybe the reason for this bug).
However, do you think the removal is worth to be mentioned in the NEWS,
too.

Thanks & have a nice day,
Berny



reply via email to

[Prev in Thread] Current Thread [Next in Thread]