guix-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Identical files across subsequent package revisions


From: zimoun
Subject: Re: Identical files across subsequent package revisions
Date: Wed, 23 Dec 2020 11:19:33 +0100

Hi Ludo,

On Tue, 22 Dec 2020 at 23:01, Ludovic Courtès <ludo@gnu.org> wrote:

> I wanted to evaluate that by looking at store items corresponding to
> subsequent revisions of a package (be it different versions or rebuilds
> induced by dependencies), and this is what the program below does.
>
> Here are preliminary results for a few packages:

[...]

> The reason I’m looking at this is to understand how much would be gained
> in terms of bandwidth usage if we were able to avoid downloading
> individual files already in the store.  It would seem to be rather
> encouraging.

This is really interesting.  Especially through normal distribution and
co. lens.  The mean needs to be weighed: 10% of 180MB is not the same
than 55% of 1MB (icecat vs diffoscope, from “guix size” and I could
misread the numbers).  The variance between revisions is highly variable
(from 0 to significant).  The one between packages says how fat or fit
the normal distribution is.

A smarter distribution could lead to some bandwith saving.  The question
is: predict how many?  Is it worth to add a bit of complexity to save
corner cases or for all.

By corner cases, I mean that for example I am not updating my Emacs at
each revisions.

<https://data.guix.gnu.org/repository/1/branch/master/package/emacs/output-history>




> Below is the program that does that.  It grabs revision history from
> data.guix.gnu.org, fetches nars from ci.guix.gnu.org, computes a
> “digest” (list of files along with their hash and size), compares
> package digests pairwise, and plots the result with Guile-Charting.
> Example REPL session:
>
> --8<---------------cut here---------------start------------->8---
> scheme@(similarities)>  (pairwise-similarities (package-archive-contents 
> "icecat" #:max 4))
> updating substitutes from 'https://ci.guix.gnu.org'... 100.0%
> $86 = (17363915/196387883 11380615/98152193)
> scheme@(similarities)> (map exact->inexact $86)
> $87 = (0.08841642740249916 0.11594865740799087)
>
> […]
>
> scheme@(similarities)> ,pp (at-most 10 (package-instances "r-minimal") )
> $100 = (#<<package-instance> version: "4.0.3" output: 
> "/gnu/store/vv3ca1r5zw5y35xgkix4r80hdnncx52b-r-minimal-4.0.3">
>  #<<package-instance> version: "4.0.3" output: 
> "/gnu/store/5dzad7nhhv3dvmap60d6gsj9ppflgzrd-r-minimal-4.0.3">
>  #<<package-instance> version: "4.0.3" output: 
> "/gnu/store/01xi3sig314wgwa1j9sxk37vl816mj74-r-minimal-4.0.3">
>  #<<package-instance> version: "4.0.2" output: 
> "/gnu/store/nv7lqhnm0mncqwdpkkhnlsgb562lcwff-r-minimal-4.0.2">
>  #<<package-instance> version: "4.0.2" output: 
> "/gnu/store/w0izbm8q26dmyndhv46xr7dgz1irai1z-r-minimal-4.0.2">
>  #<<package-instance> version: "4.0.2" output: 
> "/gnu/store/yd83ibzxjrb7cgcc6d4smx4pqcdl8r3p-r-minimal-4.0.2">
>  #<<package-instance> version: "4.0.1" output: 
> "/gnu/store/kpdh0lwlwcwfmmfzqhwbi6j7m4zzxlmn-r-minimal-4.0.1">
>  #<<package-instance> version: "4.0.1" output: 
> "/gnu/store/9x9nzzsiyn1q7g5myhgwjh0yx9js3nrj-r-minimal-4.0.1">
>  #<<package-instance> version: "4.0.0" output: 
> "/gnu/store/ahbm2gsqc3420a23pcwrxd4pdhl7rdpp-r-minimal-4.0.0">
>  #<<package-instance> version: "4.0.0" output: 
> "/gnu/store/0sgqhj2628js419wvw1vc1cw07wil7gr-r-minimal-4.0.0">)
> $101 = (#<<package-instance> version: "3.6.3" output: 
> "/gnu/store/gmx6p0wk3xbc9ylv83zfj855azgjxr0p-r-minimal-3.6.3">
>  #<<package-instance> version: "3.6.2" output: 
> "/gnu/store/dnb6fzp5295fcda66dnjk2y51mcva20f-r-minimal-3.6.2">
>  #<<package-instance> version: "3.6.1" output: 
> "/gnu/store/gd6sm42b6fr1qgyp6p1zp3z4aavxwyk2-r-minimal-3.6.1">
>  #<<package-instance> version: "3.6.1" output: 
> "/gnu/store/lpmfhxys3vsaqmqvj85r344ygfmlmlbg-r-minimal-3.6.1">
>  #<<package-instance> version: "3.6.1" output: 
> "/gnu/store/4413h13v7zrb7rp9lvyp2gfx2laj60wm-r-minimal-3.6.1">
>  #<<package-instance> version: "3.6.1" output: 
> "/gnu/store/zm5pl1y0wmh3c845498pbjvrzrm6sb07-r-minimal-3.6.1">
>  #<<package-instance> version: "3.6.1" output: 
> "/gnu/store/xrv7y4xgrdrsx5qba7144cpw69qc5f0j-r-minimal-3.6.1">
>  #<<package-instance> version: "3.6.0" output: 
> "/gnu/store/cbwhhqv69xi3k3g25dcfhwjkkf2427rp-r-minimal-3.6.0">
>  #<<package-instance> version: "3.6.0" output: 
> "/gnu/store/69k46yr70zkzzz9v18wl7sxasha5m0a5-r-minimal-3.6.0">
>  #<<package-instance> version: "3.6.0" output: 
> "/gnu/store/71w0383x0hay6ng1zaddz59by3g0gxaf-r-minimal-3.6.0">
>  #<<package-instance> version: "3.6.0" output: 
> "/gnu/store/m317mg8faxp9qn949dnv1xgsxyw57s3x-r-minimal-3.6.0">
>  #<<package-instance> version: "3.5.3" output: 
> "/gnu/store/33qdfplk4riig77vqg758m1zll16n6f0-r-minimal-3.5.3">
>  #<<package-instance> version: "3.5.3" output: 
> "/gnu/store/g8gkrcxn49yj8zjr59l7y4k7wgar5brq-r-minimal-3.5.3">
>  #<<package-instance> version: "3.5.1" output: 
> "/gnu/store/vrgbyvnx7b1sva2ij5a1gwrkbfwmg3lm-r-minimal-3.5.1">
>  #<<package-instance> version: "3.5.1" output: 
> "/gnu/store/4fzw7s0cv2zbixw4wb58zq6lkd97ghnf-r-minimal-3.5.1">
>  #<<package-instance> version: "3.5.1" output: 
> "/gnu/store/yb5048dr40nbmyfa1ar4hfiy3kd06v4c-r-minimal-3.5.1">)
> scheme@(similarities)> (similarity-chart '("icecat" "gimp" "openmpi" "emacs" 
> "diffoscope" "r-minimal") "/tmp/t.png" #:max 8)
> updating substitutes from 'https://ci.guix.gnu.org'... 100.0%
> $102 = #<cairo-surface 7f502c7adb10>
> --8<---------------cut here---------------end--------------->8---
>
> Thoughts?  :-)
>
> Ludo’.
>
> ;;; Copyright © 2020 Ludovic Courtès <ludo@gnu.org>
> ;;; Hereby placed under the GNU General Public License, version 3 or later.
>
> (define-module (similarities)
>   #:use-module (json)
>   #:use-module ((gcrypt hash) #:select (open-sha256-port))
>   #:use-module ((guix store) #:select (%default-substitute-urls))
>   #:use-module ((guix progress) #:hide (dump-port*))
>   #:use-module (guix http-client)
>   #:use-module (guix serialization)
>   #:use-module (guix scripts substitute)
>   #:use-module (srfi srfi-1)
>   #:use-module (srfi srfi-9)
>   #:use-module (srfi srfi-11)
>   #:use-module (rnrs bytevectors)
>   #:use-module (charting)
>   #:use-module (ice-9 match))
>
>
> ;;;
> ;;; Data Service client.
> ;;;
>
> (define-json-mapping <package-instance> make-package-instance
>   package-instance?
>   json->package-instance
>   (version     package-instance-version)
>   (output      package-instance-output "derivation"))
>
> (define %data-service-base-url
>   (make-parameter "https://data.guix.gnu.org";))
>
> (define* (package-instances package #:key (branch "master"))
>   "Return a list of <package-instance> representing instances of PACKAGE over
> time known to the Data Service."
>   (match (assoc-ref (json->scm
>                      (http-fetch (string-append (%data-service-base-url)
>                                                 "/repository/1/branch/"
>                                                 branch "/package/" package
>                                                 "/output-history.json")))
>                     "derivations")
>     (#(lst ...)
>      (map json->package-instance lst))
>     (#f
>      #f)))
>
>
> ;;;
> ;;; Similarity measurement.
> ;;;
>
> (define (port-sha256* port size)               ;from (guix scripts challenge)
>   ;; Like 'port-sha256', but limited to SIZE bytes.
>   (let-values (((out get) (open-sha256-port)))
>     (dump-port* port out size)
>     (close-port out)
>     (get)))
>
> (define-record-type <file-info>
>   (file-info name type size sha256)
>   file-info?
>   (name   file-info-name)
>   (type   file-info-type)
>   (size   file-info-size)
>   (sha256 file-info-sha256))
>
> (define (archive-contents port)
>   "Return a list of <file-info> records from the nar read from PORT."
>   ;; As in (guix scripts challenge), but return <file-info> records that
>   ;; include file size and ignore symlinks.
>   (fold-archive (lambda (file type contents result)
>                   (match type
>                     ((or 'regular 'executable)
>                      (match contents
>                        ((port . size)
>                         (cons (file-info file type size
>                                          (port-sha256* port size))
>                               result))))
>                     ('directory result)
>                     ('directory-complete result)
>                     ('symlink result)))
>                 '()
>                 port
>                 ""))
>
> (define (narinfo-contents narinfo)             ;from (guix scripts challenge)
>   "Fetch the nar described by NARINFO and return a list representing the file
> it contains."
>   ((@@ (guix scripts challenge) call-with-nar) narinfo archive-contents))
>
> (define (at-most max-length lst)              ;from (guix scripts substitute)
>   "If LST is shorter than MAX-LENGTH, return it and the empty list; otherwise
> return its MAX-LENGTH first elements and its tail."
>   (let loop ((len 0)
>              (lst lst)
>              (result '()))
>     (match lst
>       (()
>        (values (reverse result) '()))
>       ((head . tail)
>        (if (>= len max-length)
>            (values (reverse result) lst)
>            (loop (+ 1 len) tail (cons head result)))))))
>
> (define* (package-archive-contents package
>                                    #:key (max 10)
>                                    (substitute-urls %default-substitute-urls))
>   "Look at the MAX latest instances of PACKAGE, fetch them, and return a
> summary of their contents as returned by 'narinfo-contents'."
>   (let ((instances (at-most max (package-instances package))))
>     (map narinfo-contents
>          (lookup-narinfos (first substitute-urls)
>                           (map package-instance-output instances)))))
>
> (define (similarity contents1 contents2)
>   "Return two values: the ratio of identical bytes between CONTENTS2 and
> CONTENTS2, and the ratio of identical files."
>   (define (matches name)
>     (lambda (info)
>       (string=? (file-info-name info) name)))
>
>   (let ((files (delete-duplicates
>                 (append (map file-info-name contents1)
>                         (map file-info-name contents2)))))
>     (let loop ((files files)
>                (seen 0)
>                (identical 0)
>                (seen-bytes 0)
>                (identical-bytes 0))
>       (match files
>         (()
>          (values (/ identical-bytes seen-bytes)
>                  (/ identical seen)))
>         ((head . tail)
>          (let ((file1 (find (matches head) contents1))
>                (file2 (find (matches head) contents2)))
>            (cond ((not file1)
>                   (loop tail
>                         (+ seen 1) identical
>                         (+ seen-bytes (file-info-size file2))
>                         identical-bytes))
>                  ((not file2)
>                   (loop tail
>                         (+ seen 1) identical
>                         (+ seen-bytes (file-info-size file1))
>                         identical-bytes))
>                  (else
>                   (let ((identical?
>                          (and (= (file-info-size file1)
>                                  (file-info-size file2))
>                               (bytevector=? (file-info-sha256 file1)
>                                             (file-info-sha256 file2)))))
>                     (loop tail
>                           (+ seen 1)
>                           (if identical?
>                               (+ identical 1)
>                               identical)
>                           (+ seen-bytes
>                              (max (file-info-size file1)
>                                   (file-info-size file2)))
>                           (if identical?
>                               (+ identical-bytes (file-info-size file1))
>                               identical-bytes)))))))))))
>
> (define (pairwise-similarities contents)
>   (let loop ((contents contents)
>              (similarities '()))
>     (match contents
>       ((or () (_))
>        (reverse similarities))
>       ((a b . tail)
>        (loop (cons b tail)
>              (cons (similarity a b) similarities))))))
>
> (define* (similarity-chart packages file
>                            #:key (max 20))
>   (make-bar-chart
>    "Similarity across subsequent package revisions"
>    (map (lambda (package)
>           (let* ((contents     (package-archive-contents package
>                                                          #:max max))
>                  (similarities (pairwise-similarities contents))
>                  (labels       (iota (length similarities))))
>             `(,package
>                ,@(map (lambda (label ratio)
>                         `(,(* ratio 100.) ,(number->string label)))
>                       labels similarities))))
>         packages)
>    #:chart-params '(#:x-axis-label "package"
>                     #:y-axis-label "similarity ratio (%)")
>    #:write-to-png file))



reply via email to

[Prev in Thread] Current Thread [Next in Thread]