[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: RFC: split(1) and content-defined chunking, e.g. --bytes h/N & --lin
From: |
Leonid Evdokimov |
Subject: |
Re: RFC: split(1) and content-defined chunking, e.g. --bytes h/N & --line-bytes=h/N |
Date: |
Wed, 15 Jan 2025 15:48:36 +0300 |
On Wed, Jan 15, 2025 at 3:20 PM Pádraig Brady <P@draigbrady.com> wrote:
> Might coreutils csplit be a better place for this,
> given the split is dependent on the content?
I can argue for both sides :-)
I've picked split over csplit as CDC treats input as a stream of bytes
and targets specific output size like split does. The largest sematic
bit in split is a single-byte record delimiter. csplit is more line-
and regexp-oriented, so it's less of a byte-level processing.
Meanwhile, I totally agree that there is certain overlap between split
and csplit goals and there might be some desire to have some
hash-based behavior for csplit as well. However, I'd rather consider
adding code to support patterns like these:
$ split ... --separator ',\n' ... # multi-byte string, usual JSONL separator
$ split ... --separator '</subdoc>' ... # another somewhat common
multi-byte string
$ split ... --separator <(grep --byte-offset ...) ... # grep is good
at regexps :-)
So the power of grep might be used to specify _potential_ cut points
and some CDC hash might be used to pick a subset of those cuts.
--
WBRBW, Leonid Evdokimov, https://darkk.net.ru tel:+79816800702
PGP: 6691 DE6B 4CCD C1C1 76A0 0D4A E1F2 A980 7F50 FAB2