guix-patches
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug#51838] [PATCH v6 03/41] guix: node-build-system: Add JSON utilities


From: Philip McGrath
Subject: [bug#51838] [PATCH v6 03/41] guix: node-build-system: Add JSON utilities.
Date: Fri, 31 Dec 2021 00:22:48 -0500
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.3.1

Hi Liliana,

On 12/30/21 11:56, Liliana Marie Prikler wrote:
> Am Donnerstag, dem 30.12.2021 um 02:38 -0500 schrieb Philip McGrath:
>> This commit adds several utility functions for non-destructive
>> transformation of the JSON representation used by (guix build json),
>> particularly for purely functional update of JSON objects.  They
>> should
>> eventually be exported, but most are private for now to allow for
>> more
>> experience and consideration before commiting to the API.  The design
>> was largely inspired by the 'racket/dict' and 'racket/hash'
>> libraries.
>> Liliana Marie Prikler proposed 'with-atomic-json-file-replacement'.
> Given that this is a fair amount of procedures that you're proposing, I
> think a new file would be appropriate.  Perhaps (guix build json-
> utils)?  Adding that should IIUC not cause a world rebuild, so we could
> do that on master.
>

I agree that these functions ultimately belong in their own file, and I'd even had the name (guix build json-utils) in mind.

I put them in (guix build node-build-system) for now because, if they were in (guix build json-utils), they would have to be exported, at which point their API would have to be relatively stable, and I didn't want designing them to block, or to be rushed by, the rest of this series. Now, maybe consensus on the json-utils will turn out to be the easy part! But my high-level question is, in your view, do any of the points I'm about to respond to block this patch series?

On 12/30/21 13:18, Liliana Marie Prikler wrote:
Having argued for these procedures to be moved into their own file in a
separate mail, now it's time to bikeshed stylistic choices.

Am Donnerstag, dem 30.12.2021 um 02:38 -0500 schrieb Philip McGrath:
+(define (jsobject-ref js key failure-result)
+  "Return the value assosciated with KEY in the json object JS.  If
KEY is not
+found and FAILURE-RESULT is a procedure, it is called in tail position
with
+zero arguments.  Otherwise, FAILURE-RESULT is returned."
+  ;; TODO: `failure-result` should be optional, but should the default
+  ;; `failure-result` be #f (like `assoc-ref`), a thunk raising an
exception,
+  ;; '(@), or something else?  Keep it mandatory until we discuss and
decide.
+  (match js
+    (('@ . alist)
+     (match (assoc key alist)
+       (#f
+        (if (procedure? failure-result)
+            (failure-result)
+            failure-result))
+       ((_ . value)
+        value)))))
We can safely replace failure-result by Guile's DEFAULT and leave error
handling to the user.

I don't care whether we call it DEFAULT or FAILURE-RESULT.

I agree that it should not raise an exception by default. My current thinking is that '(@) would be a good default DEFAULT: it is useful for the common pattern of traversing or transforming nested objects, and, as you point out at the end of this email, explicitly typing #f (the other useful possibility) looks much more like normal Scheme than explicitly typing '(@).

In my experience with [1] and [2] (the purely-functional dictionary libraries I use most often), the special case for a procedure as DEFAULT is useful. I feel less strongly about it because it's relatively easy to work around for JSON, since you can pick some non-JSON signal value, but it also seems to have especially little downside for JSON, since (guix build json) will never have a procedure in a valid JSON object. In addition to raising exceptions and other control flow, it's useful for default values that are expensive to produce.

[1]: https://docs.racket-lang.org/reference/hashtables.html
[2]: https://docs.racket-lang.org/reference/dicts.html


+(define (alist-pop alist key)
+  "Return two values: the first pair in ALIST with the given KEY in
its
+'car' (or #f, if no such pair exists) and an assosciation list like
(and
+potentially sharing storage with) ALIST, but with no entry for KEY."
+  (match (assoc key alist)
+    ;; If key isn't present, we don't need to do any allocation
+    (#f
+     (values #f alist))
+    (found
+     (values found
+             ;; Because we have `found`, we can find it more
+             ;; efficiently this time with `eq?`. We avoid using
+             ;; `delq` because it would copy pairs in a shared
+             ;; tail. We assume a sufficiently smart compiler to
+             ;; handle "tail recursion modulo cons" (vid. e.g. Indiana
+             ;; University Technical Report No. 19, Friedman & Wise
+             ;; 1975) at least as efficiently as a hand-written
+             ;; tail-recursive implementation with an accumulator.
+             (let loop ((alist alist))
+               (match alist
+                 ;; We know that `found` is present,
+                 ;; so no need to check for '()
+                 ((this . alist)
+                  (if (eq? this found)
+                      alist
+                      (cons this (loop alist))))))))))
I think this can be more efficiently be done in a "single" loop.

   (let loop ((rest alist)
              (previous '()))
     (match rest
       (() (values #f alist))
       ((first . rest)
        (if (eq? (car first) key)
            (values first (reverse! previous rest))
            (loop rest (cons first previous))))))


I'll admit to a Racket bias, but, having just eliminated the use of 'assoc-set!', I'm loathe to start mutating pairs (even correctly). To quote a bit from the SRFI-1 spec for 'append-reverse!', "note that this pattern of iterative computation followed by a reverse can frequently be rewritten as a recursion, dispensing with the reverse and append-reverse steps, and shifting temporary, intermediate storage from the heap to the stack, which is typically a win for reasons of cache locality and eager storage reclamation." (See how 'set-cdr!' can crash safe Chez Scheme! <https://github.com/cisco/ChezScheme/issues/599>)

IIUC, using SRFI-1's 'span' would lead to the same situation.

Also, I don't think your version is tail-recursive.  (loop alist) is
not in tail position from what I can tell.

Yes, "tail recursion modulo cons" refers to a compiler optimization for functions which are _not_ tail recursive. For full details, see the Friedman & Wise 1975 tech report I cited at <https://legacy.cs.indiana.edu/ftp/techreports/TR19.pdf> (or various other articles), but, as briefly as I can: The optimization rests on the observation that many recursive functions, like the classic definition of 'map':

    (define (map f lst)
      (match lst
        (()
         '())
        ((this . lst)
         (cons (f this)
               (map f lst)))))

are nearly tail-recursive, and the only real work remaining to be done in the continuation of the recursive call is to fill in the cdr of the pair. Thus, a compiler can safely transform this code into a truly tail-recursive implementation:

    (define (map f lst)
      (match lst
        (()
         '())
        ((this . lst)
         (define ret (list (f this)))
         (let loop ((dest ret)
                    (lst lst))
           (match lst
             ((this . lst)
              (define new (list (f this)))
              (set-cdr! dest new)
              (loop new lst))
             (()
              ret))))))

Unlike the Proper Implementation of Tail Calls (so-called "tail-call optimization"), handling "tail recursion modulo cons" truly is an optimization: it does not change the space complexity of the function. But it can allow the compiler to generate whatever code it thinks will work best with its collector/allocator and continuation/"call stack" implementation.

(The optimizations applies to constructors in general, not just 'cons', and a compiler can safely apply it to values that are immutable from the perspective of the source language.)


+;; Sadly, Guile's implementation of (@ (srfi srfi-1) alist-delete)
+;; performs unnecessary allocation, e.g. this currently evaluates to
#f:
+;;
+;;     (let ((alist `(("a" . 1)("b" . 2)("c" . 3))))
+;;       (eq? alist (alist-delete "x" alist)))
+;;
+;; These functions generally choose to allocate a new outer pair
(with the '@
+;; tag), even though in unusual cases the resulting object might not
have
+;; changed, for the sake of simplicity and to avoid retaining a
reference to
+;; the original alist longer than necessary. But that is O(1)
allocation that
+;; could only rarely be avoided: `alist-delete` would allocate O(n)
pairs,
+;; which would only be necessary in the worst case.
+(define (alist-delete* alist key)
+  "Return an assosciation list like (and potentially sharing storage
with)
+ALIST, but with no entry for KEY."
+  (define-values (_popped remaining)
+    (alist-pop alist key))
+  remaining)
That's a pretty long comment around something that could be done with
call-with-values or SRFI-71 let.  I think one of these two should be
preferred.

Note that both our versions of alist-pop only pop the first key (as
they should).  This means that alist-delete* should really be called
alist-delete-1 as in "remove the first pair in ALIST belonging to KEY".
For the larger JSON handling below, this makes no difference however.

Here I was using '*' to mean "a slightly altered version of", as with 'letrec' and 'letrec*', but, yes, since the other functions defined here use '*' to mean "zero or more times", the name is confusing: I think I'd just call it 'alist-delete' and not import (srfi srfi-1)'s version.

The comment may be unnecessarily long ... the essence of what I was trying to explain is that, in all of these implementations, I've tried to avoid unnecessary allocation. Being able to rely on 'alist-delete', and more generally 'alist-pop', not to needlessly copy the "spine" of the list lets later functions use them unconditionally.

Why would you prefer 'call-with-values' or SRFI-71 over 'define-values'? The style guide against which I'm used to working [3] generally prefers internal definitions, to avoid rightward drift.

[3]: https://docs.racket-lang.org/style/Choosing_the_Right_Construct.html#%28part._.Definitions%29

+(define (alist-set alist key value)
+  "Return an assosciation list like ALIST, but with KEY mapped to
VALUE,
+replacing any existing mapping for KEY."
+  (acons key value (alist-delete* alist key)))
Is order relevant here?  Because we could just as well reimplement our
alist-delete* loop and cons the replacement onto the rest.  WDYT?

Relying on order for JSON objects is non-interoperable, per RFC 8259 § 4. I'm not intending for these alist procedures to be exported, so I'm not trying to handle any more general case than that, as I explain in the comments at the top of the file.

I'm not sure what the advantage would be to reimplementing the 'alist-delete' loop here.


+(define (jsobject-set js key value)
+  "Return a json object like JS, but with KEY mapped to VALUE,
replacing any
+existing mapping for KEY."
+  (cons '@ (match js
+             (('@ . alist)
+              (alist-set alist key value)))))
I think it'd be wiser to put the cons inside the match.


I don't care very much either way.

+(define jsobject-set*
+  (case-lambda
+    "Return a json object like JS, but functionally extended by
mapping each
+KEY to each VALUE, replacing any existing mapping for each KEY.  The
update
+takes place from left to right, so later mappings overwrite earlier
mappings
+for the same KEY."
+    ((js)
+     js)
+    ((js key value)
+     (jsobject-set js key value))
+    ((js . args)
+     (cons '@ (match js
+                (('@ . alist)
+                 (let loop ((alist alist)
+                            (args args))
+                   (match args
+                     (()
+                     alist)
+                     ((key value . args)
+                      (loop (alist-set alist key value)
+                            args))))))))))
I'm not sure if I like this "syntax".  I think I'd prefer
   (jsobject-set* obj (FIELD1 VALUE1) (FIELD2 VALUE2) ...)
with FIELD1, FIELD2 being identifiers
WDYT?

So you would make 'jsobject-set*' a macro? When you say, "with FIELD1, FIELD2 being identifiers", do you mean that the macro should convert them to strings at compile-time? While, if I were designing a JSON representation, I'd much prefer to use symbols for the object keys, I think it would be confusing to use strings everywhere else but magic symbols here.

I based this function on 'hash-set*' [4], 'dict-set*' [5], and numerous similar functions in the Racket world, so I have a high degree of confidence in the usability of the interface.

[4]: https://docs.racket-lang.org/reference/hashtables.html#%28def._%28%28lib._racket%2Fprivate%2Fbase..rkt%29._hash-set%2A%29%29 [5]: https://docs.racket-lang.org/reference/dicts.html#%28def._%28%28lib._racket%2Fdict..rkt%29._dict-set%2A%29%29


+(define (alist-update alist key failure-result updater)
+  "Return an assosciation list like ALIST, but with KEY mapped to
the result
+of applying UPDATER to the value to which KEY is mapped in ALIST.
When ALIST
+does not have an existing mapping for KEY, FAILURE-RESULT is used as
with
+'jsobject-ref' to obtain the argument for UPDATER."
+  ;; Often, `updater` will be a lambda expression, so making it the
last
+  ;; argument may help to makes the code legible, and the most
likely
+  ;; `failure-result` arguments are all shorter than the keyword
+  ;; `#:failure-result`.  Plus, making `failure-result` mandatory
helps make
+  ;; `alist-update` consistent with `alist-update*`.
Which alist-update* are you referring to here?  Either way, the
failure-result to default argument from above applies, but we could
keyword it.

Ah, I guess read that as, "Plus, making 'default' mandatory helps make 'jsobject-update' consistent with 'jsobject-update*'."

+  (define-values (popped tail-alist)
+    (alist-pop alist key))
+  (acons key
+         (updater (match popped
+                    (#f
+                     (if (procedure? failure-result)
+                         (failure-result)
+                         failure-result))
+                    ((_ . value)
+                     value)))
+         tail-alist))
SRFI-71 let says hi.  Also the ordering question applies.  I'm starting
to think we should implement alist-pop, alist-set and alist-update in
terms of a single more powerful function producing three values (or
SRFI-1 span).

Same question again re 'define-values'.

My intent in creating 'alist-pop' was to have a primitive that would work for both 'alist-update' and 'alist-delete', and thereby 'alist-set'. Returning the prefix and the tail separately would involve either extra allocation or mutating pairs. Since order never matters in this context, why pay that price?

+(define* (jsobject-union #:key
+                         (combine (lambda (a b) b))
+                         (combine/key (lambda (k a b) (combine a
b)))
+                         #:rest json-objects)
+  "Combine the given JSON-OBJECTS into a single json object.  The
JSON-OBJECTS
+are merged from left to right by adding each key/value pair of each
object to
+the aggregate object in turn.  When one of the JSON-OBJECTS contains
a mapping
+from some key KEY to a value VAL such that the aggregate object
already
+contains a mapping from KEY to a value VAL0, the aggregate object is
+functionally updated to instead map KEY to the value of (COMBINE/KEY
KEY VAL0
+VAL).  The default COMBINE/KEY tail-calls (COMBINE VAL0 VAL), and
the default
+COMBINE simply returns its second argument, so, by default, mappings
in later
+JSON-OBJECTS supersede those in earlier ones."
+  (match (filter (lambda (v)
+                   (not (or (keyword? v)
+                            (procedure? v))))
+                 json-objects)
+    (()
+     '(@))
+    (((and js0 ('@ . _)))
+     js0)
+    ((('@ . alist0) ('@ . alist*) ...)
+     (cons '@ (fold (lambda (alist1 alist0)
+                      (if (null? alist0)
+                          alist1
+                          (fold (lambda (k+v alist0)
+                                  (match k+v
+                                    ((k . v)
+                                     (define-values (popped tail-
alist)
+                                       (alist-pop alist0 k))
+                                     (match popped
+                                       (#f
+                                        (cons k+v tail-alist))
+                                       ((_ . v0)
+                                        (acons k
+                                               (combine/key k v0 v)
+                                               tail-alist))))))
+                                alist0
+                                alist1)))
+                    alist0
+                    alist*)))))
Same default argument.  Cons inside.
I think having a single combine function taking (k a b) would be less
confusing than having two.  Is there a rationale for the form you
chose?

I based this function in particular on 'hash-union' from 'racket/hash' [6], which uses these keywords. (But in 'hash-union', collisions trigger an exception by default, and it requires at least one argument, because otherwise it would be unclear what key-comparison function the result should use.)

Having '#:combine' in addition to '#:combine/key' is ultimately just a convenience, but it is quite a nice convenience in practice. It is quite rare, in my experience, for a '#:combine' function to actually depend on the key: it might depend on the type of the value, but, very often, it unconditionally applies to all keys. Using '#:combine' is particularly nice when using a combination function that already exists, like 'append' or '+'.

[6]: https://docs.racket-lang.org/reference/hashtables.html#%28def._%28%28lib._racket%2Fhash..rkt%29._hash-union%29%29

+  (with-atomic-json-file-replacement "package.json"
+    (lambda (pkg-meta)
+      (jsobject-update*
+       pkg-meta
+       "devDependencies" '(@) resolve-dependencies
+       "dependencies" '(@) (lambda (deps)
+                             (resolve-dependencies
+                              (jsobject-union
+                               (jsobject-ref pkg-meta
"peerDependencies" '(@))
+                               deps))))))
    #t)
We should probably add a function to our js utils that "generates an
empty object", because '(@) is quite confusing to see in these
circumstances.  Otherwise LGTM with the aforementioned caveats.

I'm not sure what to call it: it would have to be short, or people (me, at least) might end up writing '(@) anyway. Also, IIUC Guile doesn't actually prevent you from mutating quoted constant pairs, so I function would have to allocate a fresh pair to be robust.

It's a somewhat odd idea, but how about this?

    (define-syntax |{}| (identifier-syntax '(@)))

It's as short as '(@), it looks like the JSON notation for the empty object, and IIUC people could only use it to mess up occurrences of '(@) within the same compilation unit, which we can't stop them from doing anyway.

Alternatively, if we give up the thunk special case for 'default' values, we could do:

    (define jsobject-update
      (case-lambda
        ((js key updater)
         (jsobject-update js key '(@) updater))
        ((js key default updater)
         ...)))
    (define jsobject-update*
      (case-lambda
        ...
        ((js . args)
         (match js
           (('@ . alist)
            (cons '@ (let loop ((alist alist)
                                (args args))
                       (match args
                         (()
                          alist)
                         ((key (? procedure? updater) . args)
                          (loop (alist-update alist key '(@) updater)
                                args))
                         ((key default updater . args)
                          (loop (alist-update alist key '(@) updater)
                                args))))))))))

-Philip





reply via email to

[Prev in Thread] Current Thread [Next in Thread]