[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-xorriso] generating reproducible ISOs with xorriso
From: |
Daniel Kahn Gillmor |
Subject: |
Re: [Bug-xorriso] generating reproducible ISOs with xorriso |
Date: |
Fri, 05 Jun 2015 12:36:40 -0400 |
User-agent: |
Notmuch/0.20.1 (http://notmuchmail.org) Emacs/24.4.1 (x86_64-pc-linux-gnu) |
Hi Thomas--
Thanks for all your work looking into this!
On Fri 2015-06-05 10:57:38 -0400, Thomas Schmitt wrote:
> About the --sort-weight-list approach which is possible with
> already released xorriso versions:
>
>> (find . -type f -print0 | xargs -0 md5sum | sort | cut -f2- -d/ ; find .
>> -mindepth 1 \! -type f | sort | cut -f2- -d/ ) | awk '{ N=N+1; print N " "
>> $0 }'
>
> I misunderstood the role of md5sum here. Actually it seems
> surplus. Why not just sort the paths ? That would be enough to
> give awk a reproducible input sequence.
Right, but it would seem to fail for hardlinked files or deduped files,
because it would weight one of the files in different places than the
other.
> Ok. The risk of a random collision is avoided and 2 billion
> files is not a severe limitation. (But the hardlinks ...)
>
> xorriso will not understand the "\n" which md5sum substitutes
> for newline characters in filenames. So trying to process such
> filenames will not be reliably reproducible and throw errors:
> xorriso : FAILURE : Cannot find path 'a\nb' in loaded ISO image
> One would have to set before -as mkisofs:
> -abort_on fatal
> in order to avoid a premature end of the program run.
> The attribution of weights would stop in any case.
Yep, i understand this limitation. For this first-pass hackery, I think
i'm ok with the idea that reproducibility fails if you put a newline in
a filename. Presumably the same goes for files that have a literal
backslash (\) in their name as well, since md5sum has to escape those
too. (sane people shouldn't be putting newlines and backslashes in
filenames anyway!)
> There is no need to attribute weight to directories.
> It applies only to the content source objects of regular files.
> ("Regular file" in the ISO, not necessarily on hard disk).
Thanks, that's useful to know, and it makes the command cleaner.
> So how about this:
>
> if test $(find . -name '*'$'\n''*' | wc -w) -gt 0
> then
> echo "FOUND FILENAMES WITH NEWLINES UNDERNEATH $(pwd)" >&2
> exit 1
> fi
>
> find . -type f -print | \
> sort | cut -f2- -d/ | awk '{ N=N+1; print N " " $0 }'
As i mentioned above, i don't like that sorting just by name seems to
miss out on the dedup/hardlink compression. I want reproducible images
*and* compact images (and a pony! :) )
Also, the above doesn't do anything for non-directory, non-regular files
(sockets, fifos, device nodes, etc) -- do those even make sense in
ISO-9660? Do we need to worry about how/where they sort?
> Extent location of regular files:
>
> The question was:
> If i sort the hardlink-merged IsoFileSrc according to
> a ISO 9660 directory tree traversal, will the sequence be
> reproducible for trees with identical file names and
> attributes ?
>
> I now verified that the directories get sorted according
> to their ISO 9660 names. The process of name collision
> resolution (mangling) is complicated but depends only on
> the user defined input names and their sequence. Name sorting
> happens before mangling and afterwards.
> (libisofs/ecma119_tree.c funtions ecma119_tree_create(),
> sort_tree(), mangle_tree(), qsort(3) in mangle_single_dir())
> So there should be no permutations of identical name lists
> possible.
This is a triply-nice result, esp. because
* it includes the hardlink-merged files, and
* it puts the extents in an order that seems intelligible from a scan
of the dirtree, and
* it piggybacks off of sorting work that's already being done, so
doesn't seem to introduce much extra overhead.
> Extent location of directories:
>
> Looks already reproducible.
> They get stored after volume descriptors but not before block 32.
> (The extent address of the root directory can be read as little
> endian 32 bit number from byte 32924 to 32927 of the ISO.
> ECMA-119 8.4.18 and 9.1.3)
> The production of extents traverses the sorted ISO tree.
> (libisofs/ecma119.c function write_dirs())
> The size of a directory extent depends on name lengths and
> attributes of the files inside the directory.
>
> Then there are the Path Tables (nobody reads them):
>
> Looks already reproducible.
> The sequence of entries is determined by an array pathlist[]
> which gets filled by traversal of the sorted ISO tree.
> (libisofs/ecma119.c function write_path_tables())
Nice, these are both good news.
> So i will go for the reproducible array of IsoFileSrc in
> libisofs/filesrc.c function filesrc_writer_pre_compute().
> The red-black tree shall merge hardlinks but not define
> the sequence of data file extents.
One other possible approach occurred to me yesterday:
What if you kept the red/black tree implementation, but keyed it by file
content digest (md5, sha1, sha256, whatever) instead of by dev/inode
tuple? Using such a red-black tree for extent placement would give you
not only hardlink discovery and reproducibility, but also automatic
deduplication for even more compact images in some situations.
Advantages:
* even more compact images in some cases
* images with renamed files or added or removed hardlinks would only
vary by directory entry, not by file content placement
Downsides:
* It would certainly be more compute cycles than the existing approach
(or the dirtree traversal ordering you describe above)
* placement of the extents in a single image is less
comprehensible/obvious than the dirtree traversal ordering
* Maybe there is some context where deduplicated files could be
dangerous?
So maybe it's not worth doing, i just wanted to describe the possibility
(you probably saw it already). I leave it in your hands :)
> This can last a few days. I will give a note to the lists
> when the GNU xorriso-1.4.1 development tarball is worth a
> test.
That's great to hear. Thank you, Thomas!
Regards,
--dkg
- Re: [Bug-xorriso] generating reproducible ISOs with xorriso, (continued)
- Re: [Bug-xorriso] generating reproducible ISOs with xorriso, Thomas Schmitt, 2015/06/04
- Re: [Bug-xorriso] generating reproducible ISOs with xorriso, Thomas Schmitt, 2015/06/04
- Re: [Bug-xorriso] [Reproducible-builds] generating reproducible ISOs with xorriso, Daniel Kahn Gillmor, 2015/06/04
- Re: [Bug-xorriso] generating reproducible ISOs with xorriso, Thomas Schmitt, 2015/06/04
- Re: [Bug-xorriso] generating reproducible ISOs with xorriso, Daniel Kahn Gillmor, 2015/06/04
- Re: [Bug-xorriso] generating reproducible ISOs with xorriso, Thomas Schmitt, 2015/06/04
- Re: [Bug-xorriso] generating reproducible ISOs with xorriso, Daniel Kahn Gillmor, 2015/06/04
- Re: [Bug-xorriso] generating reproducible ISOs with xorriso, Daniel Kahn Gillmor, 2015/06/05
- Re: [Bug-xorriso] generating reproducible ISOs with xorriso, Thomas Schmitt, 2015/06/05
- Re: [Bug-xorriso] generating reproducible ISOs with xorriso, Thomas Schmitt, 2015/06/05
- Re: [Bug-xorriso] generating reproducible ISOs with xorriso,
Daniel Kahn Gillmor <=
- Re: [Bug-xorriso] generating reproducible ISOs with xorriso, Thomas Schmitt, 2015/06/05
- Re: [Bug-xorriso] generating reproducible ISOs with xorriso, Daniel Kahn Gillmor, 2015/06/05