bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Degraded performance in cat + patch


From: Pádraig Brady
Subject: Re: Degraded performance in cat + patch
Date: Fri, 6 Mar 2009 10:15:43 +0000
User-agent: Thunderbird 2.0.0.6 (X11/20071008)

Tzvi Rotshtein wrote:
> Hi,
> I've been using "cat" to feed large files into some data cruncher
> application using something like this:
>    cat my_data | data_cruncher

Well ideally you should just `data_cruncher < my_data` and
the operating system should handle getting the data from disk
in the most efficient way possible.

> However, cat was reading/writing the file in sub-optimal speeds (not even
> half as fast as the disk & os can provide it). I traced this to the buffer
> size selection algorithm in "cat", while generally provides good balance
> with low memory footprint, it constraints cat from reaching the disk's (or
> OS caches) peak performance.

There are lots of variables involved here; disk elevators, filesystems,
various caches, SSDs, available memory, ...

> While it is usually not crucial for most applications to have "cat"
> operating at peak performance, I thought it would be useful to let the user
> determine that.

> $ time ./cat test_sample_150mb_file.txt > /dev/null
> 0.00user 0.54system 0:00.59elapsed 90%CPU (0avgtext+0avgdata 0maxresident)k
> 0inputs+0outputs (0major+186minor)pagefaults 0swaps

Note that is reading from cache at 254MB/s
I'm guessing that part of the file was read from disk.

> $ time ./cat -r 1048576 test_sample_150mb_file.txt > /dev/null
> 0.00user 0.09system 0:00.12elapsed 73%CPU (0avgtext+0avgdata 0maxresident)k
> 0inputs+0outputs (0major+444minor)pagefaults 0swaps

Note that is reading from cache at 1.25GB/s

> The ability to specify an explicit (and larger) buffer size has improved the
> performance by a factor of x5 on my test system, which is quite a noticeable
> gain, especially when dealing with files at least 50GB in size.

You will definitely not get any gain for large files as the default
buffer size for cat is more than enough to saturate any current disk.

If you do want to control the buffer size then the dd command allows this 
already.
I.E. dd bs=1M if=my_data | data_cruncher

Note I previously proposed a patch to dd to support a streaming option,
which would also help for your case. That patch hinted to the operating
system to read larger chunks of the file from disk, and more importantly
not put any of the data into the cache as usually you don't want to evict
the current contents of the cache.

cheers,
Pádraig.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]