Ticket #1020 (closed defect: fixed)

Opened 10 years ago

Last modified 9 years ago

aptitude sometimes spins forever when in --download-only mode

Reported by: jdreed Owned by:
Priority: high Milestone: Upstream Utopia
Component: -- Keywords:
Cc: Fixed in version:
Upstream bug:  LP:975793  DebianBug:629266

Description (last modified by jdreed) (diff)

It may be relevant that when the problem does occur, it's always in the second invocation, when there aren't actually any files to download.

Change History

comment:1 Changed 10 years ago by jdreed

  • Priority changed from normal to high

auto-udpate is now wedged on granola in a similar state

In each case, it fails inside "aptitude --quiet --assume-yes --download-only dist-upgrade at
Writing extended state information....

In this case, it merely wants to upgrade gdm-config, which shouldn't be a hard transaction. This possibly points to an internal error in aptitude, especially since we're just asking it to download, which is not a hard operation.

/mit/jdreed/Public/granola-update.log for what it looks like right now.

comment:2 Changed 10 years ago by jdreed

I forgot that granola is running sshd. Backtracing the wedged aptitude, which is

5661 ? Sl 64:06 aptitude --quiet --assume-yes --download-only dist-upgrade

#0  0x00007fbe3d2d981d in __libc_waitpid (pid=<value optimized out>, 
    stat_loc=<value optimized out>, options=<value optimized out>)
    at ../sysdeps/unix/sysv/linux/waitpid.c:41
#1  0x00007fbe3e7b9c63 in ExecWait(int, char const*, bool) ()
   from /usr/lib/libapt-pkg.so.4.10
#2  0x00007fbe3e83c8be in pkgDPkgPM::RunScriptsWithPkgs(char const*) ()
   from /usr/lib/libapt-pkg.so.4.10
#3  0x00007fbe3e844b05 in pkgDPkgPM::Go(int) ()
   from /usr/lib/libapt-pkg.so.4.10
#4  0x00007fbe3e7d6f85 in pkgPackageManager::DoInstallPostFork(int) ()
   from /usr/lib/libapt-pkg.so.4.10

comment:3 Changed 10 years ago by jdreed

Er, sorry, there's also:

31328 ?        S      0:00 /bin/sh -c /usr/sbin/dpkg-preconfigure --apt || true
31329 ?        R      0:00 /usr/bin/perl -w /usr/sbin/dpkg-preconfigure --apt

Looks like dpkg-preconfigure is been repeatedly called and failing. Over the past minute, I've seen at least 10 processes similar to the ones above. There's only ever one set in ps output, but they're appearing, terminating, and respawning, AFAICT. They do so fast enough that I can't even attach gdb in time. Anyone debugging should repeatedly run "ps auxww" a few times, grepping for dpkg, and you'll see them.

comment:4 Changed 10 years ago by jdreed

  • Summary changed from cron job to ensure auto-update doesn't get wedged to auto-update sits at "Writing extended state info"

comment:5 Changed 10 years ago by jdreed

  • Priority changed from high to blocker
  • Milestone changed from Fall 2011 to Natty Release

This is actually a release blocker, since the machines can't be fixed without intervention, and neither I nor hotline will be visiting every single cluster machine again. If we don't have a solution tomorrow, I propose we push out the release anyway, with the following addition to auto-update that gets dropped into cron.hourly:

#!/bin/bash

UPD_START=$(stat -c "%Y" /var/run/athena-nologin 2>/dev/null)
[ -z "$UPD_START" ] && exit 0
NOW=$(date +"%s")
ELAPSED=$(expr $NOW - $UPD_START)
if [ $ELAPSED -gt 3600 ]; then
   pkill -f athena-auto-update
   # (or maybe just reboot?)
fi
exit 0

comment:6 Changed 10 years ago by jdreed

Er, maybe add

[ "$(machtype -L)" = "debathena-cluster" ] || exit 0

at the top there, depending on whether we pkill or reboot. (Or maybe regardless?)
I tested killing the proc on w20-575-2 when it was wedged, and rebooting is fine, since the "aptitude install" stage of auto-update will get things going again on the next invocation.

comment:7 Changed 10 years ago by jdreed

  • Owner set to jdreed
  • Status changed from new to accepted

Geoff identified the code that breaks, but we still don't know why it gets called.

A horrible hack was committed and pushed out in auto-update 1.31

comment:8 Changed 10 years ago by jdreed

Fixed less stupidly and more functionally in auto-update 1.32, which just got pushed out. Keeping this open until we have a fix for the actual bug. Geoff notes that this is  DebianBug:629266, and I concur.

comment:9 Changed 10 years ago by jdreed

  • Priority changed from blocker to high
  • Summary changed from auto-update sits at "Writing extended state info" to aptitude sometimes spins forever when in --download-only mode
  • Description modified (diff)
  • Milestone changed from Natty Release to Fall 2011

comment:10 Changed 10 years ago by jdreed

The upstream bug appears to be going nowhere fast, and every time I try to debug this problem, I can't reproduce it. We should probably focus our efforts on a non-crappy version of athena-auto-update or something. Or consider using apt-get to do the downloading, since we really only care about aptitude for its dependency resolver. Or will that be harder?

comment:11 Changed 9 years ago by jdreed

This apparently got fixed on May 5, but we may not see it until Quantal?

comment:12 Changed 9 years ago by jdreed

I seem to be encountering this more on my Precise VM. Do we want to continue sucking it up, or try and get this SRU'd to Precise, or what?

comment:13 Changed 9 years ago by jdreed

  • Upstream bug set to LP:975793 Debian:629266

comment:14 Changed 9 years ago by jdreed

AFAICT, I can eliminate the problem by commenting out the only line in /etc/apt/apt.conf.d/70debconf, which wants to run dpkg-preconfigure. Is it reasonable to do that during an auto-update? Certainly it's less klunky than our timeout(1) solution.

comment:15 Changed 9 years ago by jdreed

  • Status changed from accepted to committed

So, I encountered a borked auto-update, and ln -nsf'd dpkg-preconfigure to /bin/true, and auto-update picked up and continued on normally. This implies it's either more subtle than the original upstream bug, or there were two bugs. I've gone ahead and inhibited pre-configuring during auto-update -- since it's unattended, it's pointless anyway.

comment:16 Changed 9 years ago by jdreed

Nope, that doesn't fix it. Apparently aptitude is just broken. "yay"

comment:17 Changed 9 years ago by jdreed

Actually, that did make it continue far enough to get to the post-invoke scripts, where it also failed. So let's just disable everything in download mode and see what happens, because why not.

comment:18 Changed 9 years ago by jdreed

  • Owner jdreed deleted
  • Status changed from committed to new

comment:19 Changed 9 years ago by jdreed

Nope, aptitude is still sitting in DoInstallPostFork?, despite the fact that there's nothing to do.

comment:20 Changed 9 years ago by jdreed

And the borked version is still in Quantal. Someone should get upstream to take 0.6.7 into Quantal. Or we can wait until April 2013, whatever.

comment:21 Changed 9 years ago by jdreed

  • Milestone changed from Precise Release to Upstream Utopia

comment:22 Changed 9 years ago by jdreed

  • Upstream bug changed from LP:975793 Debian:629266 to LP:975793 DebianBug:629266

comment:23 Changed 9 years ago by jdreed

  • Status changed from new to closed
  • Resolution set to fixed

According the LP bug, Quantal took the new version. I see no reason to switch back to aptitude for auto-update/install, however.

Note: See TracTickets for help on using tickets.