bug#34133: Huge memory usage and output size when using "H" and "G"

bug-sed

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#34133: Huge memory usage and output size when using "H" and "G"

From:	Assaf Gordon
Subject:	bug#34133: Huge memory usage and output size when using "H" and "G"
Date:	Sat, 19 Jan 2019 20:37:35 -0700
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.4.0

Hello,

On 2019-01-19 7:23 p.m., Hongxu Chen wrote:

     We think the way sed works may suffer from attacks.


Not more than any other program that uses memory or disk
based on its input.

If the user downloads some
sed scripts and run *without root privilege*, the host machine may soonexceed
the memory;


First,
root privileges have nothing to do with it.
A non-privileged user can consume as much memory as they want,
and a poorly configured machine will be brought to its knees.

Well configured machines will not allow user-space programs
to cripple them (but I readily admit that such configuration
is not trivial to achieve).

Second,
Any user who downloads any program from untrusted source is
expected to know what they're doing.
If they don't - then the can cause a lot of damage.

This has nothing to do with sed.

in my case, the machine actually hangs and I have to restartit.


Yes, many common installations are not configured to handle
memory exhaustion.

The
problem may be severer when the machine is hosting some service or does
the sed relevant service such as text processing (may be rare) itselfeven insidesome sandbox. The issue may also be triggered unconsciously thus causesurprise
and trouble.


That is all true, but has nothing to do with sed.

> Any program that keeps the input in memory is vulnerable to unboundedinput size
I think input size is not big; and the size can still be reduced aslong as more "G;H"s
are appended to the script.
  Maybe sed can do something flush to avoid memory usage?


I'll rephrase my words:

Your sed program has O(2^N) space requirements,
Any program that have exponential behavior (be it space or time)
will quickly lead to pathological cases.

So while your input file is not too big (N=30 lines),
it leads to huge memory requirements.

I recommend reading some background about complexity
(most of these deals with Time complexity, but it applies to space as well):
  http://bigocheatsheet.com/
  https://en.wikipedia.org/wiki/Time_complexity#Exponential_time
  https://en.wikipedia.org/wiki/Computational_complexity

To illustrate my point further, here are similar examples in AWK and
PERL that would choke with your input:

  awk '{ buf = buf $1 ; buf = buf buf } END { print buf }'
  perl -ne '$buf .= $_ ; $buf .= $buf ; print $buf' < input

Just as you wrote above, a non-root user who runs these on your input
will consume a huge amount of memory. This is why I said it has nothing
to do with sed.

To see the unfolding of exponential growth in action,

try the following C program (which goes to show it is not just(awk/perl/sed scripts):


    /* Compile with:
          gcc -o 1 1.c -lm

       Run with:
         seq 30 | ./1
    */
    #define _GNU_SOURCE
    #define _POSIX_C_SOURCE 200809L
    #include <stdlib.h>
    #include <stdio.h>
    #include <math.h>

    int main()
    {
      char *buf,*line;
      size_t linenum = 0;
      size_t buf_size = 0;
      size_t n;
      ssize_t i;

      while ( (i = getline(&line,&n, stdin)) != -1) {
         ++linenum;
         buf_size += n;
         buf_size *= 2;

printf ("line %zu (%zu bytes), 2^%zu == %g, buf_size = %zubytes\n",

                linenum, n, linenum, pow(2, linenum) , buf_size);
      }
      return 0;

    }


The above does not actually allocate memory, it just shows how much
would be allocated. Run it with your input and see for yourself:

  line 1 (120 bytes), 2^1 == 2, buf_size = 240 bytes
  line 2 (120 bytes), 2^2 == 4, buf_size = 720 bytes
  line 3 (120 bytes), 2^3 == 8, buf_size = 1680 bytes
  [...]
  line 30 (120 bytes), 2^30 == 1.07374e+09, buf_size = 257698037520 bytes

If you tried to do "malloc(buf_size)" it would consume all your memory
(if you have less than 256GB).

----

This applies not just to memory (RAM), but to disk space as well
(which conceptually is just different type of storage).

Try this example:

    printf a > in
    for i in $(seq 20); do cat in in > out ; mv out in ; done ;

The input file starts with 1 byte.
After 20 rounds, the file size will be 1MB.
If you try 30 rounds, the file will be 1GB.
The "20" and "30" here correspond to your input size (number of lines).
Small change in input leads to large changes in output (thus
"exponential" programs are considered bad).

If you are still not convinced, try 40 rounds - that will attempt
to create a 1TB file. If you disk is smaller - it will become full
(which is just like memory exhaustion).

----

I hope this resolves the issue.

regards,
 - assaf

[Prev in Thread]

Current Thread

[Next in Thread]

bug#34133: Huge memory usage and output size when using "H" and "G", Hongxu Chen, 2019/01/19
- bug#34133: Duplicate of 34133, Hongxu Chen, 2019/01/19
- bug#34133: Huge memory usage and output size when using "H" and "G", Assaf Gordon, 2019/01/19
  - bug#34133: Huge memory usage and output size when using "H" and "G", Hongxu Chen, 2019/01/19
    - bug#34133: Huge memory usage and output size when using "H" and "G", Assaf Gordon <=
    - bug#34133: Huge memory usage and output size when using "H" and "G", Hongxu Chen, 2019/01/20

Prev by Date: bug#34133: Huge memory usage and output size when using "H" and "G"
Next by Date: bug#34141: Stackoverflow triggered at lib/regexec.c:1948
Previous by thread: bug#34133: Huge memory usage and output size when using "H" and "G"
Next by thread: bug#34133: Huge memory usage and output size when using "H" and "G"
Index(es):
- Date
- Thread