Re: RFC: split(1) and content-defined chunking, e.g. --bytes h/N & --lin

coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: RFC: split(1) and content-defined chunking, e.g. --bytes h/N & --lin

From:	Leonid Evdokimov
Subject:	Re: RFC: split(1) and content-defined chunking, e.g. --bytes h/N & --line-bytes=h/N
Date:	Wed, 15 Jan 2025 15:48:36 +0300

On Wed, Jan 15, 2025 at 3:20 PM Pádraig Brady <P@draigbrady.com> wrote:
> Might coreutils csplit be a better place for this,
> given the split is dependent on the content?

I can argue for both sides :-)

I've picked split over csplit as CDC treats input as a stream of bytes
and targets specific output size like split does. The largest sematic
bit in split is a single-byte record delimiter. csplit is more line-
and regexp-oriented, so it's less of a byte-level processing.

Meanwhile, I totally agree that there is certain overlap between split
and csplit goals and there might be some desire to have some
hash-based behavior for csplit as well. However, I'd rather consider
adding code to support patterns like these:

$ split ... --separator ',\n' ... # multi-byte string, usual JSONL separator

$ split ... --separator '</subdoc>' ... # another somewhat common
multi-byte string

$ split ... --separator <(grep --byte-offset ...) ... # grep is good
at regexps :-)

So the power of grep might be used to specify _potential_ cut points
and some CDC hash might be used to pick a subset of those cuts.

-- 
WBRBW, Leonid Evdokimov, https://darkk.net.ru tel:+79816800702
PGP: 6691 DE6B 4CCD C1C1 76A0  0D4A E1F2 A980 7F50 FAB2

[Prev in Thread]

Current Thread

[Next in Thread]

RFC: split(1) and content-defined chunking, e.g. --bytes h/N & --line-bytes=h/N, Leonid Evdokimov, 2025/01/15
- Re: RFC: split(1) and content-defined chunking, e.g. --bytes h/N & --line-bytes=h/N, Pádraig Brady, 2025/01/15
  - Re: RFC: split(1) and content-defined chunking, e.g. --bytes h/N & --line-bytes=h/N, Leonid Evdokimov <=
  - Message not available
    - Re: RFC: split(1) and content-defined chunking, e.g. --bytes h/N & --line-bytes=h/N, Pádraig Brady, 2025/01/20
    - Re: RFC: split(1) and content-defined chunking, e.g. --bytes h/N & --line-bytes=h/N, Leonid Evdokimov, 2025/01/20

Prev by Date: Re: RFC: split(1) and content-defined chunking, e.g. --bytes h/N & --line-bytes=h/N
Next by Date: Re: [PATCH] tests: protect ulimit -v determination with a timeout
Previous by thread: Re: RFC: split(1) and content-defined chunking, e.g. --bytes h/N & --line-bytes=h/N
Next by thread: Re: RFC: split(1) and content-defined chunking, e.g. --bytes h/N & --line-bytes=h/N
Index(es):
- Date
- Thread