Ticket #873 (closed defect: fixed)

Opened 10 years ago

Last modified 9 years ago

Our build infrastructure is not reliable

Reported by: jdreed Owned by:
Priority: blocker Milestone: Precise Alpha
Component: -- Keywords:
Cc: Fixed in version:
Upstream bug:

Description (last modified by geofft) (diff)

zulu keeps exploding. Today, it exploded for the 3rd time in the "Hack binNMU version" stage of the build, though I have no idea why.

This happened while attempting to do a time-sensitive build, which is very not good.

Change History

comment:1 Changed 10 years ago by andersk

Did you see my reply last week?

debathena / zulu / jdreed  2011-04-11 22:19  (This zephyr does not necessarily reflect the views of IS&T, MIT, its)
    This is the 3rd time it's happened in the "Hack binNMU version" stage
debathena / zulu / andersk  2011-04-12 00:36  (Anders Kaseorg)
    > This is the 3rd time it's happened in the "Hack binNMU version" stage

    That’s … actually kinda funny, because it’s exactly the section that
    I patched in 0.60.1-1andersk1.  I can’t imagine what I did to cause
    kernel panics.
debathena / zulu / andersk  2011-04-12 00:37  (Anders Kaseorg)
    The patch is what I posted to
    http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=620112

comment:2 Changed 10 years ago by jdreed

  • Priority changed from blocker to high
  • Summary changed from Our build infrastructure is not reliable to Migrate zulu to not-Debian
  • Type changed from defect to enhancement
  • Description modified (diff)
  • Milestone changed from Natty Beta to Fall 2011

comment:3 Changed 10 years ago by geofft

  • Priority changed from high to blocker
  • Summary changed from Migrate zulu to not-Debian to Our build infrastructure is not reliable
  • Type changed from enhancement to defect
  • Description modified (diff)
  • Milestone changed from Fall 2011 to Natty Beta

This ticket has nothing to do with OS; in fact, Lucid, being an older OS with an older kernel, is probably more likely to randomly panic on the codepaths we call than Squeeze.

Paths forward include debugging aufs, moving away from unioning file systems to something like unpacking chroots from tarballs, or moving away from using the aufs kernel patch to union mounts or overlayfs or unionfs_fuse or something. (But this ticket is separate from #463; we're interested in a different union solution in the clusters as well, and it may well be that a different solution like tarballs is the right answer for the builder and not for clusters.)

comment:4 Changed 9 years ago by geofft

r25470 switches make-chroot to using tar-based chroots. This takes a bit of time to unpack (anywhere from 15 seconds to over a minute depending on how warm the cache is and how contended the operation is), but is otherwise perfectly reliable because it's entirely in userspace, and also uses a bunch less disk space as a side benefit. I've converted the precise chroot to one of these, and it's worked smoothly for the precise build.

We should convert the remaining chroots on zulu, and figure out what to do with the dink volume group. (I think we can empty it of everything but schroot-scratch, so it's possible that vg should move onto the zulu volume group.) Alternatively, we could wipe and reinstall the build server as proposed in #940.

comment:5 Changed 9 years ago by jdreed

  • Status changed from new to closed
  • Resolution set to fixed

OK, after 2 weeks of tar-based chroots, things look good. I've gone ahead and un-bzipped them, because the disk space increase is minimal and the performance benefit is high. The legacy schroot entries ("-classic") and the LVs are still there, and should remain there for a bit longer, because disk is cheap and paranoia is good.

Note: See TracTickets for help on using tickets.