bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Grep -quiet not working with -file


From: Stepan Kasal
Subject: Re: Grep -quiet not working with -file
Date: Wed, 28 Jan 2004 12:08:07 +0100
User-agent: Mutt/1.4.1i

Hello,

On Tue, Jan 27, 2004 at 07:41:34PM -0000, Mark Palmer wrote:
> The -q option does not seem to improve the speed when it's used in
> conjunction with the -f option.
> 
> I guessing that it still goes round checking all instances of the patterns
> supplied in the file, instead of exiting at the first match.

I guess I see where the problem is, but it's not trivial to explain,
so please be patient, even though the following might seem too long:

When grep gets more then one pattern, either with multiple -e options
or with -f option, it ``learns them all''.

Thus calling
        grep -e one -e two -e three
is roughly equivalent to
        grep 'one|two|three'

In both cases the ``learning'' of the patterns means that grep builds
an internal structure that represents the pattern.   This internal
data structure is built before grep reads any input file, but it's
constructed so that the search through data is as quick as possible.

The idea behind it is that the input files are usually much bigger
then the pattern lists.

And this may be the thing you observe: even though -q is given,
grep starts with taking ``the full armor'', and then proceeds to
the input text (where it exits on first match, because of -q).

In, other words, big pattern lists slow down the startup of grep.

To improve performance, you may try to splitting the pattern file to
several smaller ones and call grep like this:
  grep -qf file1 infile ||
  grep -qf file2 infile ||
  ...

(And you should put the patterns which are more likely to appear to
the file1.)

Or you can try to use one big regex instead of myultiple patterns.
The pattern file would thus contain one long line:
        one|two|three|...|ninethousand

(Though I said above that the two forms are "rougly equivalent",
you may experiencea difference in performance.)

Or you may combine the two tricks together.

There are also other performance issues; for example it could help
to set LC_ALL=C to avoid UTF-8, but that would drag me too far...

Hope this helps,
        Stepan Kasal




reply via email to

[Prev in Thread] Current Thread [Next in Thread]