|
From: | Maxim Cournoyer |
Subject: | bug#52051: [core-updates-frozen] cannot login ('org.freedesktop.login1' service times out) |
Date: | Fri, 10 Dec 2021 00:09:28 -0500 |
User-agent: | Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux) |
Hello Ludovic! Ludovic Courtès <ludo@gnu.org> writes: > Hello! > > Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis: > >> 374 connect(11, {sa_family=AF_UNIX, >> sun_path="/var/run/dbus/system_bus_socket"}, 34) = 0 > > [...] > >> 374 epoll_wait(5, [{events=EPOLLIN|EPOLLOUT|EPOLLHUP, data={u32=24802800, >> u64=24802800}}], 20, -1) = 1 >> 374 sendmsg(11, {msg_name=NULL, msg_namelen=0, >> msg_iov=[{iov_base="l\1\0\1\0\0\0\0\1\0\0\0m\0\0\0\1\1o\0\25\0\0\0/org/freedesktop/DBus\0\0\0\3\1s\0\5\0\0\0Hello\0\0\0\2\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\6\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0", >> iov_len=128}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, >> MSG_DONTWAIT|MSG_NOSIGNAL) = -1 EPIPE (Broken pipe) >> 374 gettid() = 374 >> 374 epoll_ctl(5, EPOLL_CTL_MOD, 11, {events=0, data={u32=24802800, >> u64=24802800}}) = 0 >> 374 timerfd_settime(12, TFD_TIMER_ABSTIME, {it_interval={tv_sec=0, >> tv_nsec=0}, it_value={tv_sec=0, tv_nsec=1}}, NULL) = 0 >> 374 epoll_wait(5, [{events=EPOLLHUP, data={u32=24802800, u64=24802800}}, >> {events=EPOLLIN, data={u32=24764384, u64=24764384}}], 20, -1) = 2 >> 374 read(12, "\1\0\0\0\0\0\0\0", 8) = 8 >> 374 gettid() = 374 >> 374 timerfd_settime(12, TFD_TIMER_ABSTIME, {it_interval={tv_sec=0, >> tv_nsec=0}, it_value={tv_sec=0, tv_nsec=1}}, NULL) = 0 >> 374 epoll_wait(5, [{events=EPOLLHUP, data={u32=24802800, u64=24802800}}, >> {events=EPOLLIN, data={u32=24764384, u64=24764384}}], 20, -1) = 2 >> 374 read(12, "\1\0\0\0\0\0\0\0", 8) = 8 >> 374 gettid() = 374 >> 374 timerfd_settime(12, TFD_TIMER_ABSTIME, {it_interval={tv_sec=0, >> tv_nsec=0}, it_value={tv_sec=0, tv_nsec=1}}, NULL) = 0 >> 374 epoll_wait(5, [{events=EPOLLHUP, data={u32=24802800, u64=24802800}}, >> {events=EPOLLIN, data={u32=24764384, u64=24764384}}], 20, -1) = 2 >> 374 read(12, "\1\0\0\0\0\0\0\0", 8) = 8 >> 374 epoll_ctl(5, EPOLL_CTL_DEL, 11, NULL) = 0 >> 374 close(11) = 0 >> 374 gettid() = 374 >> 374 epoll_wait(5, <unfinished ...> >> 391 <... close resumed>) = 0 >> 391 madvise(0x7fd6c83dc000, 8368128, MADV_DONTNEED) = 0 >> 391 exit(0) = ? >> 391 +++ exited with 0 +++ >> 374 <... epoll_wait resumed>[{events=EPOLLERR, data={u32=24768000, >> u64=24768000}}], 17, -1) = 1 >> 374 lseek(7, 0, SEEK_SET) = 0 >> 374 read(7, "tty7\n", 63) = 5 > > As you pointed out on IRC, the initially ‘Hello’ method call above leads > to EPIPE, and we can see that elogind eventually closes its socket to > dbus-daemon *but* keeps doing its thing. > > Some interesting things to note… Indeed, very interesting :-). Thanks for diving in! > First, to my surprise, elogind does not use the client library of the > ‘dbus’ package: > > $ guix gc --references $(./pre-inst-env guix build elogind)|grep dbus > $ echo $? > 1 > > > (This is already the case in ‘master’ with v243.7.) Instead, it has its > own implementation of the DBus protocol, in C, from systemd—we can’t > have enough sources of bugs and vulnerabilities. Oh! Marius found this interesting issue [0] that they shared on IRC today; I wonder if it may be related. sd-bus apparently pipeline things aggressively, which is not always handled well by other D-bus participants. [0] https://github.com/systemd/systemd/issues/16610 > Anyway, the “Hello” message is sent to the system bus asynchronously in > ‘sd-bus.c’: > > static int bus_send_hello(sd_bus *bus) { > _cleanup_(sd_bus_message_unrefp) sd_bus_message *m = NULL; > int r; > > assert(bus); > > if (!bus->bus_client) > return 0; > > r = sd_bus_message_new_method_call( > bus, > &m, > "org.freedesktop.DBus", > "/org/freedesktop/DBus", > "org.freedesktop.DBus", > "Hello"); > if (r < 0) > return r; > > return sd_bus_call_async(bus, NULL, m, hello_callback, NULL, 0); > } > > > A callback is called when a reply is received or an error arises: > > static int hello_callback(sd_bus_message *reply, void *userdata, sd_bus_error > *error) { > > [...] > > fail: > /* When Hello() failed, let's propagate this in two ways: first we > return the error immediately here, > * which is the propagated up towards the event loop. Let's also > invalidate the connection, so that > * if the user then calls back into us again we won't wait any > longer. */ > > bus_set_state(bus, BUS_CLOSING); > return r; > } > > > It’s not clear from that whether the authors intended for the thing to > keep going in case of failure. In our case it’s not helpful. If we picture this in the systemd use case, I believe sd-bus must be used as *the* d-bus daemon, perhaps? So it should never die, and expect users to call back into it to retry things? In our case, it acts as a D-Bus client, not a server (IIUC), so perhaps its behavior is not tuned right for our use case. > But why does dbus-daemon drop the connection in the first place? > > To know that, we could change ‘dbus-root-service-type’ to run > dbus-daemon from a ‘--enable-verbose-mode’ build, and with the > ‘DBUS_VERBOSE’ environment set to 1. > > Looking at ‘dbus-server-socket.c’ it would seem that the only sensible > way this can happen is if dbus-daemon doesn’t yet have a > ‘new_connection_function’ set at the time it accepts the incoming > connection: Interesting that you mention this; that's what I worked on yesterday (see attached patches). I managed to get elogind panicking the kernel during a guix system reconfigure, which corrupted GRUB, so had to chroot and reconfigure from there [1]. Not sure what happened, but it seems that killing and restarting elogind is susceptible to cause hard locks. [1] https://lists.gnu.org/archive/html/guix-devel/2018-06/msg00202.html
0001-gnu-Add-dbus-verbose.patch
Description: Text Data
0002-services-dbus-Add-a-VERBOSE-configuration-option.patch
Description: Text Data
0003-build-Move-D-Bus-parser-to-its-own-build-module.patch
Description: Text Data
0004-services-telephony-Factorize-bus-polling-logic.patch
Description: Text Data
0005-services-dbus-Add-synchronization-primitives-to-dbus.patch
Description: Text Data
> static dbus_bool_t > handle_new_client_fd_and_unlock (DBusServer *server, > DBusSocket client_fd) > { > > [...] > > /* See if someone wants to handle this new connection, self-referencing > * for paranoia. > */ > new_connection_function = server->new_connection_function; > new_connection_data = server->new_connection_data; > > _dbus_server_ref_unlocked (server); > SERVER_UNLOCK (server); > > if (new_connection_function) > { > (* new_connection_function) (server, connection, > new_connection_data); > } > dbus_server_unref (server); > > /* If no one grabbed a reference, the connection will die. */ > _dbus_connection_close_if_only_one_ref (connection); > dbus_connection_unref (connection); > > We know that elogind is started after dbus-daemon has written its PID > file (there’s a Shepherd service dependency). Is there a possibility > that dbus-daemon writes its PID file before it has set > ‘new_connection_function’? Are PID files conventionally agreed to be synchronization primitives? In the case of systemd, I feel this may be more of an afterthought, given the system is designed to rely on socket-based activation. Still just guessing though, and your analysis above seems to suggest otherwise. > It would seem that ‘bus_context_new’ writes the PID file after it has > called ‘dbus_server_set_new_connection_function’ via ‘setup_server’ via > ‘process_config_first_time_only’. > > So not sure what happens. > > That’s it for today. Thoughts? Thank you! I'm sorry for my lack of deep analysis here, I've spent most of my evening attempting to fix my system just to boot ^^'. I've at least managed to collect the following verbose D-Bus log (54 MiB uncompressed) which hopefully can shed some light onto how this failure came to be.
dbus-daemon.log.xz
Description: Binary data
Happy Friday! Maxim
[Prev in Thread] | Current Thread | [Next in Thread] |