bug#34133: Huge memory usage and output size when using "H" and "G"

bug-sed

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#34133: Huge memory usage and output size when using "H" and "G"

From:	Assaf Gordon
Subject:	bug#34133: Huge memory usage and output size when using "H" and "G"
Date:	Sat, 19 Jan 2019 14:27:30 -0700
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.4.0

tags 34133 notabug
close 34133
stop

Hello,

On 2019-01-19 2:53 a.m., Hongxu Chen wrote:

     We found an issue that are relevant to use of "H" and "G" for appending
hold space and pattern space.


It is an "issue" in the sense that your example does consume large
amounts of memory, but it is not a bug - this is how sed works.

     The input file is attached which is a file of 30 lines and 80 columns
filled with 'a'. And my memory is 64G with equivalent swap.

       # these two may eat up the memory
     sed 's/a/d/; G; H;' input
     sed '/b/d; G; H;' input



Let's simplify:
The "s/a/d/" does not change anything related to memory
(it changes a single letter "a" to "d" in the input), so I'll omit it.

The '/b/d' command is a no-op, because your input does not contain
the letter "b".

We're left with:
   sed 'G;H'
The length of each line also doesn't matter, so I'll use shorter lines.

Now observe the following:

$ printf "%s\n" 0 | sed 'G;H' | wc -l
2
$ printf "%s\n" 0 1 | sed 'G;H' | wc -l
6
$ printf "%s\n" 0 1 2 | sed 'G;H' | wc -l
14
$ printf "%s\n" 0 1 2 3 | sed 'G;H' | wc -l
30
$ printf "%s\n" 0 1 2 3 4 | sed 'G;H' | wc -l
62
$ printf "%s\n" 0 1 2 3 4 5 | sed 'G;H' | wc -l
126
$ printf "%s\n" 0 1 2 3 4 5 6 | sed 'G;H' | wc -l
254
$ printf "%s\n" 0 1 2 3 4 5 6 7 | sed 'G;H' | wc -l
510
$ printf "%s\n" 0 1 2 3 4 5 6 7 8 | sed 'G;H' | wc -l
1022
$ printf "%s\n" 0 1 2 3 4 5 6 7 8 9 | sed 'G;H' | wc -l
2046
$ printf "%s\n" 0 1 2 3 4 5 6 7 8 9 10 | sed 'G;H' | wc -l
4094
$ printf "%s\n" 0 1 2 3 4 5 6 7 8 9 10 11 | sed 'G;H' | wc -l
8190
$ printf "%s\n" 0 1 2 3 4 5 6 7 8 9 10 11 12 | sed 'G;H' | wc -l
16382

Notice the trend?
The number of lines (and by proxy: size of buffer and memory usage)
is exponential.

With 20 lines, you'll need O(2^20) = 1M memory (plus size of each line,
and size of pointers overhead, etc.). Still doable.

With 30 lines, you'll need O(2^30) = 1G of lines.
If each of your lines is 80 characters, you'll need 80GB (before
counting overhead of pointers).

      # this is fine
     sed '/a/d; G; H;' input


This is "fine" because the "/a/d" command deletes all lines of your
input, hence nothing is stored in the pattern/hold buffers.

     I learned from http://www.grymoire.com/Unix/Sed.html that 'G' appends
hold space to pattern space, and 'H' does the inverse.
     In the first two examples, the buffer of hold space will be appended to
pattern space, and subsequently content of pattern space will be appended
to hold space once more. With one more input line, the two buffers will be
doubled; and as long as the input file is big enough, sed may finally eat
up the memory and populate the output.


Yes, that how it works.

     We think this is vulnerable since it may eat up the memory in a few
seconds.


Any program that keeps the input in memory is vulnerable
to unbounded input size. That is not a bug.

As such, I'm closing this as "not a bug", but discussion can continue
by replying to this thread.

regards,
 - assaf

[Prev in Thread]

Current Thread

[Next in Thread]

bug#34133: Huge memory usage and output size when using "H" and "G", Hongxu Chen, 2019/01/19
- bug#34133: Duplicate of 34133, Hongxu Chen, 2019/01/19
- bug#34133: Huge memory usage and output size when using "H" and "G", Assaf Gordon <=
  - bug#34133: Huge memory usage and output size when using "H" and "G", Hongxu Chen, 2019/01/19
    - bug#34133: Huge memory usage and output size when using "H" and "G", Assaf Gordon, 2019/01/19
    - bug#34133: Huge memory usage and output size when using "H" and "G", Hongxu Chen, 2019/01/20

Prev by Date: bug#34133: Huge memory usage and output size when using "H" and "G"
Next by Date: bug#34133: Huge memory usage and output size when using "H" and "G"
Previous by thread: bug#34133: Duplicate of 34133
Next by thread: bug#34133: Huge memory usage and output size when using "H" and "G"
Index(es):
- Date
- Thread