coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Patchset pertaining to --si option of df, du, ls


From: Glenn Golden
Subject: Re: Patchset pertaining to --si option of df, du, ls
Date: Fri, 11 Sep 2020 08:23:54 -0600
User-agent: Mutt/1.10.1 (2018-07-13)

On 09 Sep 2020 17:46:36 -0700, L A Walsh wrote:
>
>
> On 9/8/2020 9:27 AM, Glenn Golden wrote:
> > The attached patchset addresses a minor issue with program behavior vs.
> > documentation of the df, du, and ls tools from coreutils-8.32, when using
> > the --si option.
> > 
> > It resurrects an issue that was brought up in 2014 [3] and eventually
> > closed in 2018 [4] with a wontfix (after minimal discussion in the
> > intervening time).
> > 
> > 
> > Summary
> > -------
> > 
> > Output from df, du, ls tools with the --si option display results using
> > single-letter units suffixes "k", "M", "G", etc., rather than "kB", "MB",
> > "GB".
> >
> 
> With or without --si?
> 

The unpatched code uses "k", "M", "G", etc. when the SI option is used.  The
patched code uses "kB", "MB", "GB" when the SI option is used. This accords
with coreutils.info Section 2.3, and is also self-consistent with all other
usage of suffixes "kB", "MB", etc, to imply decimal base.

>
> If you want to change 'si' output, I'm not against that, however I am very
> much against changing default or -h format.
> 

The proposed patch has no (intended) effect on any behavior other than the
suffixes emitted when the SI option (in any of its various forms) is used.

>
> I don't need to know 'B' as a suffix when talking about disks.
> Disks under an OS report space in base-2 multiples of bits (2**3 bits=1B).
> 

Fine, but within the context of coreutils, the "B" appended to a single-letter
suffix doesn't really mean "bytes".  It's used in an ersatz way to indicate
that the computation base for the associated numerical value is decimal.
According to coreutils.info Section 2.3, a bare "M" implies binary base,
and "MB" implies decimal base.

So that entire sub-issue -- whether the B "ought" or "ought not" be present in
the blocksize indicator suffix -- is not limited only to the SI option; there
is an entire family of blocksize options, selectable using --block-size=XXX,
that indicate which base (decimal or binary) is to be used to compute the
numerical value, and which indicator suffixes (M, MB, MiB) are to be appended
for each case.

With the patched code, the SI option would simply be one among several options
that append "B" to a single-letter suffixes.  Other blocksize options, besides
SI, already append B (e.g. --block-size=MB) with the semantic that "MB" means
1000^2.  So the patch is not introducing the use of "B" within coreutils. 
It simply makes the semantic of the appended "B" globally self-consistent
among those tools in all cases when it _is_ used.

To reiterate even more directly: What the proposed patch does is to fully
consistentize the behavior of {df, du, ls} with respect to the relationship
between computation base and the associated indicator suffixes.  With the
patched code, the following are always true, with no exceptions whatsoever:

    bare "M"  always means 1024^2, no exceptions

    "MiB"     always means 1024^2, no exceptions

    "MB"      always means 1000^2, no exceptions

The above behavior is also what coreutils.info Section 2.3 specifies, and is
also consistent with the "coreutils gotchas" (ref [2] from original post).

    "In general the units representations in coreutils are unfortunate,
     but an accident of history. POSIX species 'k' and 'b' to mean 1024
     and 512 respectively. Standards wise 'k' should really mean 1000
     and 'K' 1024. Then extending from that we now have (which we can't
     change for compatibility reasons):

            k=K=kiB=KiB=1024
            kb=KB=1000
            M=MiB=1024^2
            MB=1000^2"

With the unpatched 8.32 code, the above nice consistency is voided by the use
of the SI option:

    bare "M"  _sometimes_ means 1024^2 and _sometimes_ means 1000^2:
                * When the SI option is used, M means 1000^2
                * When the SI option is not used, M means 1024^2
    "MiB"     always means 1024^2, no exceptions
    "MB"      always means 1000^2, no exceptions

Furthermore, the "bare M" semantic inconsistency can be subtle, not always
discernable simply by inspecting the commandline, because the SI option can
be invoked via environment variables. (As an aside, this is how I got burned
by it, motivating the patch.  Imo, this is a nasty inconsistency.)

>
> In memory and disk utils used in operating systems, the assumed unit is
> the base-2 unit, the Byte and base-2 multiples thereof.  You cannot and
> should not try to mix bases when reporting sizes -- if you use Bytes (as
> on a computer), then K,M,G,... are base-2 multiples of a base-2 unit.
> 

Yes, and that's how Section 2.3 describes the semantics of those bare suffixes.
But coreutils also supports, for historical reasons unrelated to the proposed
patch, the use of kB, MB, GB, etc. to indicate decimal base.

>
> If you are talking 'b'its, it seems that is the closest practical unit
> for measurement of information.  With a single unit, base10 kbit, gbit, mbit
> or kb,gb,mb seem fine.  There is no possibility of confusion with prefixes
> for fractional values as you can't have a milli-bit or such.
> 
> I find kB confusing, since it is using the lower case 'k' as used for
> km (kilometer) and shouldn't be used where 1024 is meant.
> 

The upper- vs. lower-case k issue is historical and outside the scope of the
proposed patch.  The patch leaves existing {k,K}-case behavior exactly as-is.
See comments from [2] (above) regarding the history of this wart.

>
> I.e. when measuring values of space -- standard hard disks require 512B.
> Various utils also use 1K as a disk space size and recent hard disks
> have a 4K sector size.
>

Fine, but coreutils.info Section 2.3 makes it explicit that the blocksizes
reported by {df, du, ls} are not to be interpreted as having any particular
relationship to filesystem (or implicitly, on-disk) block sizes:

    "The block size used for display is independent of any file system block
     size.  Fractional block counts are rounded up to the nearest integer."

>
> When reporting space, you can't allocate fractional sectors so they have to
> be a multiple of 512, 1K or 4K and --si should have no place among base-2
> machines regarding disk space.
>

OK, but again, the entire issue of whether coreutils "ought" or "ought not"
support options allowing the display of disk space in decimal base is outside
the scope of the proposed patch.

Whether one agrees with it or not, coreutils does presently support decimal
base, and that support is not limited only to the SI option. The patch has no
effect on that support; its only effect is to fascistly force consistent use
of the suffixes implying binary vs. decimal base, when decimal base output
format is requested by the user, e.g. via --block-size=MB or --block-size=si.

>
> Memory is the same.  It isn't allocated in numbers that are multiples of 10.
> Anyone using nomenclature suggesting such, is demonstrating how little they
> know  about computers -- and those who use computers should listen to them
> about how to describe space?
> 
> That said, since metric more famous usage for prefixes >1 has been
> 'k', I prefer lower case for metric to be consistent with long standing
> usage of 'km' = kilometer.  Larger values, _I_ feel should be consistent
> with metric's largest usage: How often do you see a sign showing anything
> in mega-meters or giga-meters.  You ever see anything measured in
> mega-liters (let alone giga or tera liters).
> 
> Metric has standard units where the prefixes apply to the singular unit.
> A Byte isn't a singular unit of information, a 'bit' is.  Therefore standard
> 's.i.' units shouldn't really be used with non-singular units (is there
> a counter example, like where one talks about mega-[some non unary unit],
> like a gross of eggs being 1.2 deka-dozen eggs (I think)?
> 
> The main problem is that base-10 isn't a good fit for a base-2 environment,
> though I would regularly accept base-10 prefixes with bits.
> 
> So can you reserve lower case for SI, since they use lowercase 'k' for
> 1000-m and leave Uppercase (with or without B) for base-2?
> 

I would resist doing that, because the effect in the wild would be even more
extensive than the proposed patch.

The proposed patch is very simple: It has no effect on whether "k" is displayed
in upper- or lower- case, and intentionally so: Because changing the case as you
propose above ("reserve lower case for SI") would affect more than just scripts
that use the SI option. For example:

    $ df --block-size=KB /mnt/test              # Unpatched or patched code
    Filesystem     1kB-blocks   Used Available Use% Mounted on
    /dev/sda7       1879782kB 2888kB 1857459kB   1% /mnt/test

If the behavior were changed as you propose, then "kB" would change to "KB",
even though the SI option was not used.

It was an explicit goal of the proposed patch to have zero effect on existing
behavior _except_ when the SI option is used, in which case its only effect is
to substitute "MB" for bare "M", so as to enforce global consistency in the
semantics of those suffixes, among all the tools.

>
> It may not be what's authoritative, but it is what makes sense.
> 

I don't disagree with that, but given the extreme (and understandable)
sensitivity to the output-scraping issue, the patch as-is seems to be
about the best one can do without making that issue worse:  It affects only
the output produced when the SI option is explicitly specified by the caller,
and leaves everything else exactly as-is, warts [2] and all.

>
> It would also allow scripts that use existing behavior in non-base10 to
> continue working.  Though might break scripts using --si.  But how many
> scripts would use that?
> 

Imo, very few, which is why I proposed the patch as it is: The patch brings
the suffix indicators into 100% self-consistency within coreutils, and at
the only cost of possibly breaking scripts which actually use the SI option. 
And my sense -- similar to yours (I'm assuming, from your phrasing above) --
is that there are probably not many of those.

If that's true, then it seems like it may be a worthwhile tradeoff: Imposition
of self-consistent suffix semantics once and for all, vs. breaking a (probably)
small number of scripts in the wild that explicitly use the SI option.

Glenn Golden



reply via email to

[Prev in Thread] Current Thread [Next in Thread]