Ticket #783 (closed enhancement: fixed)

Opened 13 years ago

Last modified 13 years ago

We need a recovery hook

Reported by: jdreed Owned by: jdreed
Priority: normal Milestone: The Distant Future
Component: -- Keywords:
Cc: Fixed in version:
Upstream bug:


We need a somewhat reliable hook for when machines explode. This can be something like:

  • a script in AFS that gets sourced at boot time (and thus any failure is as simple as "reboot the machine")
  • a script in AFS that gets sourced by cron periodically
  • a script in AFS that gets sourced by auto-update (to fix things prior to an update).

This could potentially also minimize the need for us to do stupid version-specific things in maintainer scripts when we screw up.

Change History

comment:1 Changed 13 years ago by mitchb

I notice that all of the options Jon proposes involve AFS, which
seems high on the list of things that could be borked in an update
accident. A wget/curl would be a much more reliable thing in that
you don't need much of a system to do it, but has the authenticity
problem. https fetch of something from demeter?

comment:2 Changed 13 years ago by jdreed

Arguably, with a public root password, we'll always have an authenticity problem. If a user is going to go to the trouble of hijacking DNS and performing a MITM attack, then it would be trivial to replace any CA in the machine's keychain.

comment:3 Changed 13 years ago by mitchb

I don't see how that's at all the same issue. To use the root
password, you first have to either trick someone into running your
code or physically go compromise the machine. We're discussing
not getting duped into running an attacker's code here, in an
automated fashion, on an entire cluster.

comment:4 Changed 13 years ago by jdreed

Note that if we go the https route (compared to, say, http and a script signed with the Debathena PGP key), we should use a server with an Equifax cert, since that will work even if debathena-ssl-certificates somehow explodes.

comment:5 Changed 13 years ago by geofft

This seems way too complicated. We are depending on the recovery mode package to not be borked, so why not just stuff an extra copy of the CA into this package and run wget --ca-certificate?

(Honestly, if we want to be really sure about this, we'd statically compile a program against libcurl.)

comment:6 Changed 13 years ago by jdreed

So, I think the right thing to do here is probably to have auto-update pull something from demeter via https (verified against the MITCA) and run it. Here are the potential failure modes I see:

  • demeter's cert expires - it's important enough infrastructure that this is unlikely to happen, and in an emergency, we can get one (or is mitcert@ still a single point of failure?)
  • auto-update stops running: the alternative is a periodic cron job, and if cron is somehow broken, we lose regardless
  • ssl-certificates explodes: Any time we update ssl-certificates, a pre-requisite for it getting into proposed is that we re-test this update-recovery method.

comment:7 Changed 13 years ago by jdreed

  • Owner set to jdreed
  • Status changed from new to accepted

Here's a first pass:

Index: mitCA.crt
--- mitCA.crt	(revision 0)
+++ mitCA.crt	(revision 0)
@@ -0,0 +1,21 @@
Index: athena-auto-update
--- athena-auto-update	(revision 24997)
+++ athena-auto-update	(working copy)
@@ -142,6 +142,33 @@
 # Tell apt not to expect user input during package installation.
 export DEBIAN_FRONTEND=noninteractive
+if curl -sf -o $UPDATE_HOOK --cacert $MITCA $UPDATE_HOOK_URL; then
+   chmod 500 $UPDATE_HOOK
+   SHA256SUM=$(curl -sf --cacert $MITCA $UPDATE_HOOK_SUM)
+   rv=$?
+   if [ $rv = 0 ]; then
+       LOCALSUM=$(sha256sum $UPDATE_HOOK | awk '{print $1}')
+       if [ "$SHA256SUM" = "$LOCALSUM" ]; then
+	   if ! $UPDATE_HOOK; then
+	      complain "update hook returned non-zero status"
+	      exit
+	   fi
+       else
+	   complain "bad update hook checksum ($SHA256SUM != $LOCALSUM)"
+	   exit
+       fi
+   else
+       complain "Failed to retrieve $UPDATE_HOOK_SUM (curl returned $rv)"
+       exit
+   fi
 # Configure any unconfigured packages (Trac #407)
 if ! v dpkg --configure -a; then
   complain "Failed to configure unconfigured packages."
Index: changelog
--- changelog	(revision 25005)
+++ changelog	(working copy)
@@ -1,3 +1,10 @@
+debathena-auto-update (1.23) UNRELEASED; urgency=low
+  * Add support for an update hook to recovery from catastrophes 
+    (Trac #783)
+ -- Jonathan Reed <jdreed@mit.edu>  Mon, 07 Mar 2011 21:45:58 -0500
 debathena-auto-update (1.22.2) unstable; urgency=low
   * Use the correct version notation when removing obsolete conffiles
Index: debathena-auto-update.install
--- debathena-auto-update.install	(revision 24997)
+++ debathena-auto-update.install	(working copy)
@@ -2,3 +2,4 @@
 debian/athena-auto-update.8 usr/share/man/man8
 debian/athena-auto-upgrade usr/sbin
 debian/athena-auto-upgrade.8 usr/share/man/man8
+debian/mitCA.crt usr/share/debathena-auto-update

comment:8 follow-up: ↓ 9 Changed 13 years ago by kaduk

As Anders noted on zephyr,

   debathena / trac-#783 / andersk  21:49  (Anders Kaseorg)
       Please do shell-quote things at some point.
   debathena / trac-#783 / jdreed  21:50  (This zephyr does not necessarily refl
       The URLs?  Yeah, I realized that right after I updated Trac

though I would be inclined to put double quotes around dollar-expansions as well.

I assume that we trust the script to be running with a sane umask and /var/run to not have dumb permissions.

Is there some Debian policy about scripts having or not having .sh extensions?

comment:9 in reply to: ↑ 8 Changed 13 years ago by amu

Replying to kaduk:

though I would be inclined to put double quotes around dollar-expansions as well.

Yeah, that's generally wise.

Is there some Debian policy about scripts having or not having .sh extensions?

Per policy 10.4 ( http://www.debian.org/doc/debian-policy/ch-files.html#s-scripts), scripts in PATH should not have .sh extensions. Elsewhere, it's pretty much a matter of taste (and avoiding gratuitous differences from upstream, not that that's an issue in this case).

comment:10 Changed 13 years ago by jdreed

  • Status changed from accepted to committed

OK, this is now committed. The script will not run on -workstation by default. Right now, the update_hook sends a zephyr to -c debathena-update-hook, for testing purposes.

Before auto-update 1.23 is moved to proposed, the file should be removed, and the ACLs on that directory cleared up so that only debathena-root and ops can write to it.

We should probably also take this opportunity to repoint athena10's docroot at the AFS cell

comment:11 Changed 13 years ago by jdreed

  • Status changed from committed to development

comment:12 Changed 13 years ago by jdreed

  • Status changed from development to proposed

Moving to proposed now. I'm going to delete the hook once I see that w20-575-{1,7} have taken it.

comment:13 Changed 13 years ago by jdreed

  • Status changed from proposed to closed
  • Resolution set to fixed
Note: See TracTickets for help on using tickets.