Re: cut -b on huge files

bug-coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: cut -b on huge files

From:	Bob Proulx
Subject:	Re: cut -b on huge files
Date:	Wed, 8 Oct 2008 14:09:34 -0600
User-agent:	Mutt/1.5.13 (2006-08-11)

Klein, Roger wrote:
> I am using cut in an awkward situation: I got huge files that for any
> reason show larger file sizes than they actually have.

Those files are probably sparse files.  Sparse files can be created by
using lseek(2) to seek to a different part of the file and writing
data.  The result is a file with data at different locations but with
a gap between them.  When this is stored the filesystem can take
advantage of this by storing the gap in the file so that the file
consumes fewer disk blocks than if the gap were written too.

For example you can use dd to create a sparse file:

  dd bs=1 seek=1G if=/dev/null of=big

That will have an apparent size of 1G but will actually consume almost
no actual disk space.

> 'du' reports the correct sizes b.t.w.:
> # du -k boot_image.clone2fs
> 56740   boot_image.clone2fs

'du' reports the disk usage of the file.  This value may be smaller
than the size of the file.

Try using the --apparent-size option.

  du -k --apparent-size boot_image.clone2fs

> Now I found a hint on the Web
> (http://www.programmersheaven.com/mb/linux/187697/245244/re-how-to-change-filesize-in-linux/?S=B20000)
> for how the change the incorrect filesize by using cut to take over
> only a given amount of bytes into a new file: cut -b 1-500 oldFile > newFile

Of course that will read every byte and write every byte and the
result will no longer be sparse, assuming that the input file was
sparse.  I don't think truncating the file is really what you want to
be doing here.  If you really want to flatten the file then simply
copying it would seem to be better.

  cp --sparse=never file1 file2

or

  cat file1 > file2

> I never tried it on short files, but when I use this on the above file I
> get a very different result than expected:
> # cut -b 1-58101760 boot_image.clone2fs > boot_image.clone2fs_correct

Won't you need those bytes at the end of the file that you are
removing?  I wouldn't expect this to be good.  The ending part of the
file will be removed!  I expect that you will be needing those bytes
at some point.

> # stat boot_image.clone2fs_correct
>   File: `boot_image.clone2fs_correct'
>   Size: 309987280       Blocks: 606048     IO Block: 4096   regular file

For what it is worth those numbers don't seem to be right to me
either.  If the original stat shows 1077411840 bytes then that is the
correct size that I would hope to see in any copy.

> The number of blocks and the apparent size is all but correct now.

Try comparing the two files.

  cmp boot_image.clone2fs boot_image.clone2fs_correct

If they don't compare then I believe that you have corrupted the file.

> To me this looks like a typical overflow problem. Could you please
> investigate this?

I think your problem is understanding the difference between the
file size and the disk space consumed to hold it.

  du --apparent-size
  ls -l
  stat size
  wc -c
  ...most normal commands...

Versus:

  du
  stat blocks

Try this experiment:

  rm -f big big2
  dd bs=1 seek=1M if=/dev/null of=big
  cat big > big2
  wc -c big big2
  cmp big big2
  ls -log big big2
  du big big2
  du --apparent-size big big2

Hope this helps,
Bob

[Prev in Thread]

Current Thread

[Next in Thread]

cut -b on huge files, Klein, Roger, 2008/10/08
- Re: cut -b on huge files, Bob Proulx <=
  - Message not available
    - Re: cut -b on huge files, Bob Proulx, 2008/10/09
- Re: cut -b on huge files, Pádraig Brady, 2008/10/09

Prev by Date: cut -b on huge files
Next by Date: [bug #11004] Results of "sort" fail "sort -c" check if LANG is set and memory is low
Previous by thread: cut -b on huge files
Next by thread: Re: cut -b on huge files
Index(es):
- Date
- Thread