Ticket #1020 (accepted defect) — at Version 9

Opened 13 years ago

Last modified 12 years ago

aptitude sometimes spins forever when in --download-only mode

Reported by: jdreed Owned by: jdreed
Priority: high Milestone: Upstream Utopia
Component: -- Keywords:
Cc: Fixed in version:
Upstream bug:  LP:975793  DebianBug:629266

Description (last modified by jdreed) (diff)

It may be relevant that when the problem does occur, it's always in the second invocation, when there aren't actually any files to download.

Change History

comment:1 Changed 13 years ago by jdreed

  • Priority changed from normal to high

auto-udpate is now wedged on granola in a similar state

In each case, it fails inside "aptitude --quiet --assume-yes --download-only dist-upgrade at
Writing extended state information....

In this case, it merely wants to upgrade gdm-config, which shouldn't be a hard transaction. This possibly points to an internal error in aptitude, especially since we're just asking it to download, which is not a hard operation.

/mit/jdreed/Public/granola-update.log for what it looks like right now.

comment:2 Changed 13 years ago by jdreed

I forgot that granola is running sshd. Backtracing the wedged aptitude, which is

5661 ? Sl 64:06 aptitude --quiet --assume-yes --download-only dist-upgrade

#0  0x00007fbe3d2d981d in __libc_waitpid (pid=<value optimized out>, 
    stat_loc=<value optimized out>, options=<value optimized out>)
    at ../sysdeps/unix/sysv/linux/waitpid.c:41
#1  0x00007fbe3e7b9c63 in ExecWait(int, char const*, bool) ()
   from /usr/lib/libapt-pkg.so.4.10
#2  0x00007fbe3e83c8be in pkgDPkgPM::RunScriptsWithPkgs(char const*) ()
   from /usr/lib/libapt-pkg.so.4.10
#3  0x00007fbe3e844b05 in pkgDPkgPM::Go(int) ()
   from /usr/lib/libapt-pkg.so.4.10
#4  0x00007fbe3e7d6f85 in pkgPackageManager::DoInstallPostFork(int) ()
   from /usr/lib/libapt-pkg.so.4.10

comment:3 Changed 13 years ago by jdreed

Er, sorry, there's also:

31328 ?        S      0:00 /bin/sh -c /usr/sbin/dpkg-preconfigure --apt || true
31329 ?        R      0:00 /usr/bin/perl -w /usr/sbin/dpkg-preconfigure --apt

Looks like dpkg-preconfigure is been repeatedly called and failing. Over the past minute, I've seen at least 10 processes similar to the ones above. There's only ever one set in ps output, but they're appearing, terminating, and respawning, AFAICT. They do so fast enough that I can't even attach gdb in time. Anyone debugging should repeatedly run "ps auxww" a few times, grepping for dpkg, and you'll see them.

comment:4 Changed 13 years ago by jdreed

  • Summary changed from cron job to ensure auto-update doesn't get wedged to auto-update sits at "Writing extended state info"

comment:5 Changed 13 years ago by jdreed

  • Priority changed from high to blocker
  • Milestone changed from Fall 2011 to Natty Release

This is actually a release blocker, since the machines can't be fixed without intervention, and neither I nor hotline will be visiting every single cluster machine again. If we don't have a solution tomorrow, I propose we push out the release anyway, with the following addition to auto-update that gets dropped into cron.hourly:

#!/bin/bash

UPD_START=$(stat -c "%Y" /var/run/athena-nologin 2>/dev/null)
[ -z "$UPD_START" ] && exit 0
NOW=$(date +"%s")
ELAPSED=$(expr $NOW - $UPD_START)
if [ $ELAPSED -gt 3600 ]; then
   pkill -f athena-auto-update
   # (or maybe just reboot?)
fi
exit 0

comment:6 Changed 13 years ago by jdreed

Er, maybe add

[ "$(machtype -L)" = "debathena-cluster" ] || exit 0

at the top there, depending on whether we pkill or reboot. (Or maybe regardless?)
I tested killing the proc on w20-575-2 when it was wedged, and rebooting is fine, since the "aptitude install" stage of auto-update will get things going again on the next invocation.

comment:7 Changed 13 years ago by jdreed

  • Owner set to jdreed
  • Status changed from new to accepted

Geoff identified the code that breaks, but we still don't know why it gets called.

A horrible hack was committed and pushed out in auto-update 1.31

comment:8 Changed 13 years ago by jdreed

Fixed less stupidly and more functionally in auto-update 1.32, which just got pushed out. Keeping this open until we have a fix for the actual bug. Geoff notes that this is  DebianBug:629266, and I concur.

comment:9 Changed 13 years ago by jdreed

  • Priority changed from blocker to high
  • Summary changed from auto-update sits at "Writing extended state info" to aptitude sometimes spins forever when in --download-only mode
  • Description modified (diff)
  • Milestone changed from Natty Release to Fall 2011
Note: See TracTickets for help on using tickets.