Hello.
I'd like to ask split maintainers if a patch implementing
content-defined chunking for split is a possibly interesting one or
not.
My use-case is a simple one: I need a content-defined chunker to put
100'000 versions of a ~500 MiB text file in a git repository to use
the excellent xdelta implementation in the git toolkit. But I'd like
to split the file at the very same place to keep xdelta efficient.
I've lurked around and I've found a few projects doing that, but most
of them are unmaintained for a while.
I think of implementing a patch for split(1) extending the current CLI
interface in the following way:
--hash-seed '... ' - defines seed for the rolling hash
--bytes h/i/N[W] - split if hash(window) % N = i, producing chunk of
SIZE=N on average
--line-bytes h/i/N[W] - similar, but preserving line boundary
If `i` is omitted (e.g. h/N[W]), it defaults to 0.
`W` is the width of the rolling hash window, it defaults to 0xFFF if
it's omitted (e.g. h/N), following the default value in the Borg
backup tool.
The CDC algorithm I think of is a BUZhash-based one as it allows the
use of an arbitrary window. And, maybe, SipHash for --line-bytes if it
has measurable performance gain over BUZhash in case of limited number
of cut points.
Would alike patch be considered for inclusion into split(1) or is it
something that is not generic enough for coreutils?