Re: RFC: split(1) and content-defined chunking, e.g. --bytes h/N & --lin

coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: RFC: split(1) and content-defined chunking, e.g. --bytes h/N & --lin

From:	Pádraig Brady
Subject:	Re: RFC: split(1) and content-defined chunking, e.g. --bytes h/N & --line-bytes=h/N
Date:	Wed, 15 Jan 2025 12:20:43 +0000
User-agent:	Mozilla Thunderbird Beta

On 15/01/2025 11:56, Leonid Evdokimov wrote:

Hello.

I'd like to ask split maintainers if a patch implementing
content-defined chunking for split is a possibly interesting one or
not.

My use-case is a simple one: I need a content-defined chunker to put
100'000 versions of a ~500 MiB text file in a git repository to use
the excellent xdelta implementation in the git toolkit. But I'd like
to split the file at the very same place to keep xdelta efficient.

I've lurked around and I've found a few projects doing that, but most
of them are unmaintained for a while.

I think of implementing a patch for split(1) extending the current CLI
interface in the following way:

--hash-seed '... ' - defines seed for the rolling hash
--bytes h/i/N[W] - split if hash(window) % N = i, producing chunk of
SIZE=N on average
--line-bytes h/i/N[W] - similar, but preserving line boundary

If `i` is omitted (e.g. h/N[W]), it defaults to 0.
`W` is the width of the rolling hash window, it defaults to 0xFFF if
it's omitted (e.g. h/N), following the default value in the Borg
backup tool.

The CDC algorithm I think of is a BUZhash-based one as it allows the
use of an arbitrary window. And, maybe, SipHash for --line-bytes if it
has measurable performance gain over BUZhash in case of limited number
of cut points.

Would alike patch be considered for inclusion into split(1) or is it
something that is not generic enough for coreutils?


Interesting. I found this a good summary of Content Defined Chunking:
https://joshleeb.com/posts/content-defined-chunking.html

This might indeed be general enough for coreutils.

Might coreutils csplit be a better place for this,
given the split is dependent on the content?

cheers,
Pádraig

[Prev in Thread]

Current Thread

[Next in Thread]

RFC: split(1) and content-defined chunking, e.g. --bytes h/N & --line-bytes=h/N, Leonid Evdokimov, 2025/01/15
- Re: RFC: split(1) and content-defined chunking, e.g. --bytes h/N & --line-bytes=h/N, Pádraig Brady <=
  - Re: RFC: split(1) and content-defined chunking, e.g. --bytes h/N & --line-bytes=h/N, Leonid Evdokimov, 2025/01/15
  - Message not available
    - Re: RFC: split(1) and content-defined chunking, e.g. --bytes h/N & --line-bytes=h/N, Pádraig Brady, 2025/01/20
    - Re: RFC: split(1) and content-defined chunking, e.g. --bytes h/N & --line-bytes=h/N, Leonid Evdokimov, 2025/01/20

Prev by Date: RFC: split(1) and content-defined chunking, e.g. --bytes h/N & --line-bytes=h/N
Next by Date: Re: RFC: split(1) and content-defined chunking, e.g. --bytes h/N & --line-bytes=h/N
Previous by thread: RFC: split(1) and content-defined chunking, e.g. --bytes h/N & --line-bytes=h/N
Next by thread: Re: RFC: split(1) and content-defined chunking, e.g. --bytes h/N & --line-bytes=h/N
Index(es):
- Date
- Thread