guix-commits
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

01/03: doc: Add infra-handbook.org.


From: Maxim Cournoyer
Subject: 01/03: doc: Add infra-handbook.org.
Date: Fri, 30 Sep 2022 17:02:37 -0400 (EDT)

apteryx pushed a commit to branch master
in repository maintenance.

commit 62525c6d4986e8abf3c7f23d41ecaa9b053c7d6e
Author: Maxim Cournoyer <maxim.cournoyer@gmail.com>
AuthorDate: Wed Sep 21 11:37:52 2022 -0400

    doc: Add infra-handbook.org.
    
    * doc/infra-handbook.org: New file.
---
 doc/infra-handbook.org | 193 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 193 insertions(+)

diff --git a/doc/infra-handbook.org b/doc/infra-handbook.org
new file mode 100644
index 0000000..956173d
--- /dev/null
+++ b/doc/infra-handbook.org
@@ -0,0 +1,193 @@
+#:TITLE Guix Infrastructure Handbook
+
+This handbook is intended for sysadmin volunteers taking care of the
+infrastructure powering the Guix website, substitutes and other
+services offered via https://guix.gnu.org/.
+
+The different machines involved are registered in the
+file:../hydra/machines.rec file.
+
+* Berlin
+Berlin is the main machine, which hosts the website
+(https://guix.gnu.org/), the MUMI issue tracker
+(https://issues.guix.gnu.org/), runs the build farm
+(https://ci.guix.gnu.org/) and serves the cached substitutes.  It is
+graciously provided by the Max Delbrück Center for Molecular Medicine
+in the Helmholtz Association (MDC) and hosted at their datacenter in
+Berlin, hence its name.
+
+** Specifications
+
+Dell PowerEdge R7425 server with the following specifications:
+
+- 2x AMD EPYC 7451 24-Core processors
+- Storage Area Network (SAN) of 100 TiB
+- 188 GiB of memory
+
+The machine can be remotely administered via iDRAC, the Dell server
+management platform.
+
+Its configuration is defined in file:../hydra/berlin.scm.  Berlin has
+a machine intended to become a fallback, known as node 129, which is
+deployed from Berlin via the deploy file:
+file:../hydra/deploy-node-129.scm.
+
+** SSH access to Berlin and node 129
+
+The following ~~/.ssh/config~ snippets can be defined to access the
+Berlin machine:
+
+#+begin_src
+Host berlin
+     HostName berlin.guix.gnu.org
+     DynamicForward 8022
+     ForwardAgent yes
+#+end_src
+
+The ~DynamicForward~ on port 8022 will be explained in the iDRAC web
+access section below, while ~ForwardAgent~ is useful to have your
+agent credentials used to deploy to node 129 from Berlin available.
+
+For node 129, you can use:
+#+begin_src
+Host hydra-guix-129
+     HostName 141.80.181.41
+     DynamicForward 8022
+#+end_src
+
+** iDRAC web page access
+
+The Dell iDRAC management suite offers a web site to easily do actions
+such as rebooting a machine, changing parameters or simply checking
+its current status.  The iDRAC page of Berlin can be accessed at
+https://141.80.167.225, while node 129's page can be accessed at
+https://141.80.167.229.  Because the iDRAC web interface can only be
+accessed locally from the MDC, it is necessary to configure some HTTP
+proxy.  This can be accomplished via OpenSSH's SOCKS proxy support.
+For it to work, two things are needed:
+
+1. A ~DynamicForward~ directive on your SSH host, as shown in the
+   snippets from the above [[SSH access to Berlin and node 129]] section.
+2. A proxy auto-configuration (PAC) file to configure your browser to relay
+   requests to specific domains to through the SOCKS proxy.
+
+For GNU IceCat, the PAC file can be defined as below, and placed for
+example at ~~/.mozilla/proxy.pac~.  Then you should navigate to the
+IceCat Settings -> General -> Network Settings (completely at the
+bottom), and tick the "Automatic proxy configuration URL" checkbox,
+inputting the PAC file URI in the associated text box, e.g.:
+file://home/maxim/.mozilla/proxy.pac.  Click the "Reload" button to
+have it effective.
+
+#+begin_src
+function FindProxyForURL(url, host) {
+    if (isInNet(dnsResolve(host), "141.80.167.0", "255.255.255.0")) {
+        return "SOCKS localhost:8022; DIRECT";
+    } else {
+        return "DIRECT";
+    }
+}
+#+end_src
+
+After that, navigating to https://141.80.167.229 should display the
+iDRAC login page, as long as you have an active connection to either
+~berlin~ or ~hydra-guix-129~.
+
+** iDRAC serial console access to Berlin
+
+iDRAC also provides access to a server's serial console, which can be
+very handy to debug boot problems (before an SSH server is available).
+The iDRAC main console interfaces reachable per specific IPs private
+to the MDC network, so it is necessary to proxy jump through Berlin or
+node 129 to reach them, as shown in the ~~/.ssh/config~ configuration
+snippets below:
+
+#+begin_src
+Host hydra-guix-129-idrac
+     ProxyJump berlin
+     HostName 141.80.167.229
+     User guix
+
+Host berlin-idrac
+     ProxyJump hydra-guix-129
+     HostName 141.80.167.225
+     User guix
+#+end_src
+
+You may notice that we don't proxy jump through berlin itself to
+access its iDRAC interface, because this wouldn't work in case berlin
+is not currently running.  For the same reason, the iDRAC interface of
+node 129 is reached by proxy jumping through berlin.
+
+** Repairing a non-bootable Guix System via a PXE booted image
+
+One way to fix a non-bootable Guix System is to boot a different
+GNU/Linux system and mount the partitions and make changes to them.
+This is made possible for Berlin and node 129 by having their boot
+mode fallback to a network (PXE) boot, and using the serial console to
+navigate the boot menus.  The images are made available via the MDC
+infrastructure team via [[https://github.com/cobbler/cobbler][Cobbler]] , and 
only a few of the images
+available are bootable (sadly, Guix System is not one of them).
+
+One image which works and has Btrfs support is
+"Ubuntu-22.04-server-amd64", but you need to adjust its 'clinux'
+kernel arguments at boot to add ~console=ttyS0,115200~ in order to see
+the serial output.  There is a convenient way to turn on SSH at the
+installer screen, which you can connect to from the ~hydra-guix-129~
+machine.
+
+You can then mount the file systems and modify ~/boot/grub/grub.cfg~
+or anything.  If you need to reconfigure the machine, you can refer
+to: info:guix#Chrooting to chroot into an existing system, except
+you'll need to pass the ~--no-substitutes~ argument to ~guix-daemon~,
+otherwise it'll loop trying to fetch substitutes from
+https://ci.guix.gnu.org, in vain.  If the reconfiguration hangs, you
+may need to use ~--no-grafts~.
+
+* Btrfs file system
+
+Due to not being susceptible to the EXT4 inodes exhaustion problem and
+offering zstd compression which can almost double the actual capacity
+of a storage device at little computation cost, Btrfs is the current
+file system of choice for GNU/Linux-based Guix System build machines.
+
+** Btrfs compression and mount options
+
+To get the most out of Btrfs, enabling zstd compression is
+recommended.  When using RAID arrays, it can also be useful to use the
+~degraded~ mount option, otherwise the RAID could fail to assemble at
+boot if any drive part of the array has a problem.  Here's an alist of
+recommended mount options, taken from
+file:../hydra/deploy-node-129.scm for a build machine when high
+availability is preferred over data safety (degraded):
+
+#+begin_src scheme
+(define %common-btrfs-options '(("compress-force" . "zstd")
+                                ("space_cache" . "v2")
+                                "degraded"))
+#+end_src
+
+** Btrfs balance mcron job
+
+To ensure it operates without manual intervention, a balance job
+should run periodically to ensure the unallocated space (a
+Btrfs-specific concept) remains in check with the actual free space.
+Otherwise, the system could report ~ENOSPC~ even when common utilities
+such as ~df -h~ report plenty of free space.  To view the amount of
+available unallocated space, the ~btrfs filesystem usage /~ can be
+used.
+
+The following mcron job example is taken from the
+file:../hydra/deploy-node-129.scm machine configuration:
+
+#+begin_src scheme
+(define btrfs-balance-job
+  ;; Re-allocate chunks which are using less than 5% of their chunk
+  ;; space, to regain Btrfs 'unallocated' space.  The usage is kept
+  ;; low (5%) to minimize wear on the SSD.  Runs at 5 AM every 3 days.
+  #~(job '(next-hour-from (next-day (range 1 31 3)) '(5))
+         (lambda ()
+           (execl #$(file-append btrfs-progs "/bin/btrfs") "btrfs"
+                  "balance" "start" "-dusage=5" "/"))
+         "btrfs-balance"))
+#+end_src



reply via email to

[Prev in Thread] Current Thread [Next in Thread]