guix-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Making the Guix installer resilient against transient network issues


From: raid5atemyhomework
Subject: Making the Guix installer resilient against transient network issues
Date: Fri, 19 Mar 2021 03:09:54 +0000

Hello Guix devel,

When `guix system init` fails, there are a number of possible causes of failure:

* Packages being downloaded are so broken that they cannot actually be built.
  * QC should filter this out.
* The hardware being installed on is broken, usually a failure of the storage 
device being installed into.
* Downloading substitutes from the substitute server failed.

Of the above, the last is the most likely to occur in practice.
I have been doing a number of repeated installation tests on VMs using the 
SJTUG mirror server, as well as the Berlin Cuirass, and a significant number of 
installation attempts via the guided installer fail due to problems with 
downloading substitutes.

* From my system, the Berlin Cuirass server is very very very slow (< 40kiB/s, 
sometimes as low as 4kiB/s) and possibly because of the slowness, the download 
gets interrupted part of the way through which causes the install to fail.
* The SJTUG server sometimes responds in ways that the Guix downloader does not 
expect, causing failures.

What I do instead is to use the "manual" mode and just keep doing `guix system 
build` over and over until it manages to pull through.

I think that the guided installer should also use the same technique of trying 
`guix system build` repeatedly for at least some number of tries, possibly 
asking the user if they want to keep trying (in case the issue is a permanent 
network error rather than a transient network error).

Yes, currently a failure to install "just" kicks the user back to the guided 
install and they can rerun `guix system init`.  ***HOWEVER***, because the 
store is in a COW mode, this sometimes leaves the store in a wonky state and 
the `guix system init` performs the system build from 0, or it can fail.  Not 
to mention that this is requires more keypresses for the user.

So, let me sketch proposed changes to `gnu/installer/final.scm`:

```patch
@@ -169,6 +169,15 @@ or #f.  Return #t on success and #f on failure."
                                   "/tmp/installer-system-init-options"
                                 read))
                             (const '())))
+         (build-command   (append (list "guix" "system" "build"
+                                        "--fallback")
+                                  options
+                                  (list (%installer-configuration-file))))
+         (build-grub-command
+                          (append (list "guix" "build"
+                                        "--fallback"
+                                        "grub" "grub-efi")
+                                  options))
          (install-command (append (list "guix" "system" "init"
                                         "--fallback")
                                   options
@@ -178,6 +187,36 @@ or #f.  Return #t on success and #f on failure."
          (database-file   (string-append database-dir "/db.sqlite"))
          (saved-database  (string-append database-dir "/db.save"))
          (ret             #f))
+
+    (define* (perform-install #:optional (tries 0))
+
+      (define (retry)
+        (perform-install (+ tries 1)))
+
+      (define (ask-if-retry)
+        ;; TODO. Not sure best way to query user whether they
+        ;; would like to retry again.
+        )
+
+      (if (and (run-command build-command #:locale locale)
+               (run-command build-grub-command #:locale locale))
+          (run-command install-command #:locale locale)
+          ;; Try to recover.
+          (begin
+            (format #t "~%~%~s~%~s~%~%"
+                    (G_ "Failure while building system.")
+                    (G_ "This is usually caused by (hopefully transient) 
network errors."))
+            (cond
+              ((< tries %max-auto-system-build-retries)
+               (format #t "~s~%"
+                       (G_ "Will wait 3 seconds and retry..."))
+               (sleep 3)
+               (retry))
+              (else
+               #f)))))
+
     (mkdir-p (%installer-target-dir))

     ;; We want to initialize user passwords but we don't want to store them in
@@ -221,9 +260,8 @@ or #f.  Return #t on success and #f on failure."
                          (lambda ()
                            (with-error-to-file "/dev/console"
                              (lambda ()
-                               (run-command install-command
-                                            #:locale locale)))))
-                       (run-command install-command #:locale locale))))
+                               (perform-install)))))
+                       (perform-install))))
            (lambda ()
              ;; Restart guix-daemon so that it does no keep the MNT namespace
              ;; alive.
```

Notes:

* `guix system build` only builds the *system*.  It doesn't build the 
bootloader.  I can't find a command that builds the bootloader; only `guix 
system init` or `guix system reconfigure` do that, but we need to differentiate 
between the failure "downloading from the substituter failed" (which might be 
fixable by just retrying) from "writing to the device being installed into 
failed".
  * In the above I use `guix build grub grub-efi` as a proxy for this, but it 
would be nice if there were some kind of `guix system build-bootloader` that 
would perform *building* of the script that installs the bootloader, but 
doesn't actually install the bootloader *yet*.
* I don't know how best to ask the user if they want to retry the system 
building process.


Thanks
raid5atemyhomework



reply via email to

[Prev in Thread] Current Thread [Next in Thread]