Ticket #1052 (closed defect: fixed)

Opened 10 years ago

Last modified 9 years ago

cluster reboots sometimes hang (1)

Reported by: kaduk Owned by:
Priority: blocker Milestone: Precise Release
Component: -- Keywords:
Cc: Fixed in version:
Upstream bug:

Description

Sometimes, when a cluster machine decides it wants to reboot as a user is logging out, it will end up just spinning its dots in the Ubuntu shutdown splash screen.
Pressing 'del' to go to the console shows mostly standard shutdown messages about various daemons and such being killed by TERM, but also a few complaining about the session chroot not being able to be destroyed because it was still in use (from memory, may not be quite accurate).
Yet a three-finger salute brings the machine down and back, cleanly.

Change History

comment:1 Changed 10 years ago by jdreed

Either this or #1053 may be AFS-related. I had two machines in 1-115 today barf during shutdown of AFS with a call trace ("unable to handle kernel paging request"), and the stack trace involved waiting in an rx_read or something like that (I'm trying to get the machine ready for a class and don't have time to deal). 3-finger salute didn't work, sysRq did.

"Can we just go back to Lucid?"

comment:2 Changed 10 years ago by jdreed

Encountered this in the middle of a kexec reboot. I really think kexec is to blame here.

Please test reactivate in -dev now.

comment:3 follow-up: ↓ 4 Changed 10 years ago by jdreed

Stuart encountered a machine in Mccormick today stuck at the Ubuntu screen. DEL returned:

could not log bottup: address already in use
init: plymouth main process (5140) terminated with status 69
init: Failed to spawn setvtrgb main process: unable to change root directory: No such file or directory
init: Failed to spawn setvtrgb main process: unable to change root directory: No such file or directory
init: Failed to spawn setvtrgb main process: unable to change root directory: No such file or directory
init: Disconnected from system bus
init: dbus main process (1011) killed by TERM signal
init: statd main process (999) terminated with status 1
init: statd main process ended, respawning
init: rc main process (4916) killed by term signal
init: statd pre-start process (5284) killed by TERM signal
init: rc main process (4921) killed by TERM signal

Can we just solve all our problems by disabling Plymouth?

comment:4 in reply to: ↑ 3 Changed 10 years ago by kaduk

Replying to jdreed:

Stuart encountered a machine in Mccormick today stuck at the Ubuntu screen. DEL returned:

could not log bottup: address already in use
init: plymouth main process (5140) terminated with status 69
init: Failed to spawn setvtrgb main process: unable to change root directory: No such file or directory
init: Failed to spawn setvtrgb main process: unable to change root directory: No such file or directory
init: Failed to spawn setvtrgb main process: unable to change root directory: No such file or directory
init: Disconnected from system bus
init: dbus main process (1011) killed by TERM signal
init: statd main process (999) terminated with status 1
init: statd main process ended, respawning
init: rc main process (4916) killed by term signal
init: statd pre-start process (5284) killed by TERM signal
init: rc main process (4921) killed by TERM signal

I also saw this in w20-575 in my testing today; the three-finger salute just produced more such spew (but sysreq made it through). debathena-reactivate 2.0.25 worked fine on lola-granola, so I think we should push it out.

comment:5 Changed 10 years ago by jdreed

So, here's a slightly different failure mode that occurred today on m37-318-2. The machine was full of:
init: Failed to spawn plymouth-upstart-bridge main process: unable to change root directory: No such file or directory

The "unable to change root directory" part is intriguing.

The last few lines were:
Could not log bootup: Address already in use
and
init: plymouth main process (30745) terminated with status 69

The actual screen can be viewed at  http://mit.edu/jdreed/Public/debathena/m37-318-2-top.JPG (and m37-318-2-bottom.JPG in the same dir)

comment:6 Changed 10 years ago by kaduk

nw35-4 was frozen again today; I couldn't get to the console messages, and had to manually reboot.

comment:7 Changed 10 years ago by jdreed

OK, I saw this on 66-080-6 today. Logs and Nagios suggest that this reboot was initiated by reactivate. So I think that's still where we're losing, kexec or not. Dbus is clearly to partially to blame. The interesting thing, I think, would be knowing what the hell "status 69" means.

comment:8 follow-up: ↓ 9 Changed 10 years ago by jdreed

Anyone have any bright ideas here, short of ripping out all of Plymouth and Upstart? This happened to a machine in 4-237 and a faculty member complained.

comment:9 in reply to: ↑ 8 Changed 10 years ago by kaduk

Replying to jdreed:

Anyone have any bright ideas here, short of ripping out all of Plymouth and Upstart? This happened to a machine in 4-237 and a faculty member complained.

It's really been happening quite a lot, I think. But I don't have any ideas on it, and we don't seem to have been able to reproduce it in an environment where we can actually get debugging information.

comment:10 Changed 10 years ago by jdreed

  • Priority changed from normal to blocker

comment:11 Changed 10 years ago by jdreed

  • Milestone changed from The Distant Future to Fall 2011

comment:12 Changed 10 years ago by jdreed

We decided today that we should log as much as possible to local disk, and then find a machine that fails and look at it. There are no privacy concerns; if you're doing something bad in the cluster, don't name your script download-jstor.pl or something.

comment:13 follow-up: ↓ 14 Changed 9 years ago by jdreed

I wonder if we want to yank plymouth-theme-* from the clusters (if it's possible) so that we see normal startup and shutdown messages, in the case where 'DEL' is unresponsive.

Or is it easier to ship our own Plymouth theme (#145) which just displays all text messages?

comment:14 in reply to: ↑ 13 Changed 9 years ago by kaduk

Replying to jdreed:

I wonder if we want to yank plymouth-theme-* from the clusters (if it's possible) so that we see normal startup and shutdown messages, in the case where 'DEL' is unresponsive.

Or is it easier to ship our own Plymouth theme (#145) which just displays all text messages?

I would support doing either of these, though I'm not sure it's a very high priority.

comment:15 Changed 9 years ago by jdreed

OK, I dug into the upstart code and looked at a variety of launchpad bugs. I have a theory. This started with Natty, which coincided with when Upstart grew support for chroots (see  https://wiki.ubuntu.com/NattyNarwhal/TechnicalOverviewUpstart). I kind of wonder if the system bus is dying, getting respawned inside the chroot, and then dying when the chroot exits, and there's no bus with which to reboot. I wonder if fixing #775 will have any impact on this, and also I haven't see this failure mode on Precise. But if we do see it, I'd like to try disabling this support ("--no-sessions" on the command line) and see if that helps.

comment:16 Changed 9 years ago by jdreed

We may also want to wrap initctl inside the chroot to figure out what is actually getting called?

comment:17 Changed 9 years ago by jdreed

Encountered a borked quickstation-11-1 the other day. /var/log and some screen photos are in  /mit/jdreed/Public/qs-11-1

I don't have any bright ideas.

comment:18 Changed 9 years ago by jdreed

I don't have any bright ideas.

Oooh, but I do have a terrible one:

Index: reactivate/debian/reactivate
===================================================================
--- reactivate/debian/reactivate	(revision 25590)
+++ reactivate/debian/reactivate	(working copy)
@@ -54,5 +54,6 @@
 # perfectly good time to do so -- auto-update is inhibited during a
 # login session.
 if [ -e /var/run/reboot-required ]; then
+  sleep 30 && echo 1 > /proc/sys/kernel/sysrq && echo b > /proc/sysrq-trigger &
   reboot
 fi
Last edited 9 years ago by jdreed (previous) (diff)

comment:19 Changed 9 years ago by jdreed

Some things are interesting about the qs-11-1 logs:
-modem-manager and wpa_supplicant were started inside the chroot, thanks to D-Bus doing... something. They are also the only processes (other than metrics, which hasn't yet been killed at this point, and probably should be -- see the fact that debathena-reactivate-cleanup runs _first_ in the postsession, not last) that stuck around inside the chroot. Even more confusing is how they got started, because this is an Optiplex 760, which lacks both a modem and a wireless card.

I didn't add the self-destruct call yet, it's easy enough to do later if we think it's a good (or at least "not terrible") idea. I did add some more logging, as well as the ability for snapshot-run to attempt to recover and end the session if something fails early enough. I wonder if it's worth having 16-killprocs-no-really go through its process killing twice, so if something survived the kill -9, we at least know that was attempted (as opposed to somehow missed, or worse, the process was restarted after it was killed).

comment:20 Changed 9 years ago by jdreed

  • Status changed from new to committed

Yay race conditions. Fixed, in theory, in r25609.

I did also see a machine today sitting at the text mode console with a bunch of AFS backtraces on the screen. I think this was a symptom, rather than a cause, of a failed shutdown. Once reactivate 2.0.35 is deployed, we should keep an eye out for that situation, and open a new ticket if we see it. But I'm optimistic that between this fix and #775, we will see a lot fewer reboots.

comment:21 Changed 9 years ago by jdreed

  • Status changed from committed to development

reactivate 2.0.36 is in -dev

comment:22 Changed 9 years ago by jdreed

  • Status changed from development to closed
  • Resolution set to fixed
Note: See TracTickets for help on using tickets.