Ticket #746 (closed defect: fixed)

Opened 13 years ago

Last modified 13 years ago

zephyr subs disappear in under a minute on gaston.mit.edu

Reported by: kchen Owned by:
Priority: normal Milestone: The Distant Future
Component: -- Keywords:
Cc: Fixed in version:
Upstream bug:

Description

I heard this happen on another machine before, but hadn't seen it myself until now. gaston.mit.edu was just reinstalled yesterday or today.

paco@GASTON:~$ zctl load ; while true ; do date ; zctl ret ; sleep 10 ; done
Fri Dec 3 18:00:42 EST 2010
Class sipb Instance * Recipient *
Class consult Instance * Recipient *
Class rcc Instance * Recipient *
Class help Instance * Recipient *
Class operations Instance message Recipient *
Class message Instance personal Recipient paco@…
Class message Instance urgent Recipient paco@…
Fri Dec 3 18:00:52 EST 2010
Class sipb Instance * Recipient *
Class consult Instance * Recipient *
Class rcc Instance * Recipient *
Class help Instance * Recipient *
Class operations Instance message Recipient *
Class message Instance personal Recipient paco@…
Class message Instance urgent Recipient paco@…
Fri Dec 3 18:01:02 EST 2010
Class sipb Instance * Recipient *
Class consult Instance * Recipient *
Class rcc Instance * Recipient *
Class help Instance * Recipient *
Class operations Instance message Recipient *
Class message Instance personal Recipient paco@…
Class message Instance urgent Recipient paco@…
Fri Dec 3 18:01:12 EST 2010
Class sipb Instance * Recipient *
Class consult Instance * Recipient *
Class rcc Instance * Recipient *
Class help Instance * Recipient *
Class operations Instance message Recipient *
Class message Instance personal Recipient paco@…
Class message Instance urgent Recipient paco@…
Fri Dec 3 18:01:22 EST 2010
Fri Dec 3 18:01:32 EST 2010

Change History

comment:1 Changed 13 years ago by jdreed

This is the "two zhms" problem:

GASTON:~%ps -ax | grep -i zhm
Warning: bad ps syntax, perhaps a bogus '-'? See http://procps.sf.net/faq.html
  863 ?        Ss     0:00 /usr/sbin/zhm -f
  864 ?        Ss     0:00 /usr/sbin/zhm -f
 2660 pts/5    S+     0:00 grep -i -i zhm

See also #133

comment:2 Changed 13 years ago by jdreed

  • Status changed from new to closed
  • Resolution set to duplicate

comment:3 Changed 13 years ago by geofft

  • Status changed from closed to reopened
  • Resolution duplicate deleted

Reopening because this is a different issue from #133: that ticket is "we should not have an initscript", and this one is "assuming we have an initscript, things should not go horribly wrong."

It turns out that

  1. Starting zhm sends an HM_BOOT to the server, which flushes all local subscriptions.
  2. /etc/init.d/zhm restart calls zhm with -N, which is undocumented in the man page but suppresses this behavior.
  3. Zephyr 3 patched ZOpenPort to set SO_REUSEADDR so that restarting zhm doesn't require waiting for the old one's port to time out.
  4. Linux (at least) lets you have two UDP sockets bound to the same port if SO_REUSEADDR is bound to both.

and therefore when we call "/etc/init.d/zhm start", we start another zhm, which successfully binds despite another zhm existing, and then sends an HM_BOOT, and then you lose subs. Nothing goes particularly horribly wrong if there are multiple zhms otherwise. So in r24910 we switch to "restart", which continues to start another zhm but doesn't flush subs. Upstream will unset SO_REUSEADDR after binding, which will cause the second zhm to get EADDRINUSE and therefore die. We should still not have two zhms start at all in the first place, which is #133.

comment:4 Changed 13 years ago by geofft

  • Status changed from reopened to committed

comment:5 Changed 13 years ago by jdreed

Are we blocking on upstream, or can we push this out to development?

comment:6 Changed 13 years ago by jdreed

  • Status changed from committed to development

TTL expired

comment:7 follow-up: ↓ 12 Changed 13 years ago by kaduk

  • Status changed from development to new

Updating to zephyr-config 1.10 restarts "the" running zhm, which still leaves two running if there were two prior. I also still get two zhms after rebooting. Since this is a race condition at its core, I collected some data sets by rebooting a lot:

package state              # of zhms (per trial)
1.9.2 (three trials)       222
1.10                       2121322    (!!!)
1.10 +5s sleep in ifup.d   2221222
1.10 +10s sleep in ifup.d  111111111

Amazingly, I managed to get three zhms running at one point. I'm rather surprised that a five-second sleep was not enough to break the race, but ten seconds has done rather well in my testing. Whether we want to push a sleep hack to production or fix the state of the scripts is another question, though ...

comment:8 Changed 13 years ago by jdreed

We need to fix this. Can we just fix #133 at the same time and kill off the initscript entirely?

comment:9 follow-up: ↓ 10 Changed 13 years ago by kaduk

We should be able to do so; if we put an 'exit 0' or similar in /etc/default/zephyr-clients under appropriate conditions.

This, however, would mean that we literally need to copy and paste all the logic from /etc/init.d/zhm into the ifup/down scripts, and /etc/init.d/zhm would become dead code.

We also need to be sure to not do dumb things when there are multiple active interfaces, per zephyr. jdreed mentioned:

       Sounds like we still need to implement flock or something for this.
       So maybe we should just do that and move on.

comment:10 in reply to: ↑ 9 ; follow-up: ↓ 11 Changed 13 years ago by andersk

Replying to kaduk:

This, however, would mean that we literally need to copy and paste all the logic from /etc/init.d/zhm into the ifup/down scripts, and /etc/init.d/zhm would become dead code.

Uh, no. The point of the /etc/default/zephyr-clients suggestion was that we can put the exit 0 inside an if statement that we control. (We write something that means “if we’re running from the ifup script”, or “if we aren’t starting because the system is booting”, or whatever.)

comment:11 in reply to: ↑ 10 Changed 13 years ago by kaduk

Replying to andersk:

Replying to kaduk:

This, however, would mean that we literally need to copy and paste all the logic from /etc/init.d/zhm into the ifup/down scripts, and /etc/init.d/zhm would become dead code.

Uh, no. The point of the /etc/default/zephyr-clients suggestion was that we can put the exit 0 inside an if statement that we control. (We write something that means “if we’re running from the ifup script”, or “if we aren’t starting because the system is booting”, or whatever.)

Anders, I am aware of that methodology.
However, in the context of #133, where we want to make zhm not use an init script at all, and be entirely controlled within interface state change scripts, I do not see why having the ifup/down scripts call into an init script which serves solely as a library of common routines is useful. Having a library of routines common to up/down may (or may not) be useful, but I do not think that /etc/init.d/ is a good place for such a library. (Something like, say, /usr/lib/zephyr-clients/ seems better.) Thus far, I have not seen you say anything directly linked to the initscript-free-zhm case, and I would like to hear your thoughts on that case, which seems to be the direction we are heading towards.

comment:12 in reply to: ↑ 7 Changed 13 years ago by geofft

Replying to kaduk:

Updating to zephyr-config 1.10 restarts "the" running zhm, which still leaves two running if there were two prior. I also still get two zhms after rebooting.

Yes, that's intentional -- zephyr-config 1.10 makes the two zhm case not cause you to lose subs, but doesn't actually solve the issue that there are two zhms. That's #133, which is still open.

comment:13 Changed 13 years ago by jdreed

  • Status changed from new to development

OK, we think 1.10 works in that, it will not stop you from getting 2 zhms, but will stop subsequent zhms from killing your subs. #133 is still a different ticket. So we should push 1.10 to proposed, I think, and I"ll do so on Monday unless someone convinces me not to.

comment:14 Changed 13 years ago by jdreed

  • Status changed from development to proposed

comment:15 follow-up: ↓ 16 Changed 13 years ago by cfox

red-herring.mit.edu took the zephyr-config update on Monday, without incident. Today (Wednesday) zephyr was working fine between 11:00 and 15:00, and I had subs until 22:04.

At 22:0?, I tried to send a zephyrgram, and got an error to the effect that there was no zhm. I checked with ps, and indeed it was so. I did a clumsy job of restarting it (first restarted it as cfox, then killed that one and restarted as root) and when it was running, I no longer had subs. (So I lost subs at some point between 22:04 and 22:15.)

comment:16 in reply to: ↑ 15 Changed 13 years ago by geofft

Replying to cfox:

red-herring.mit.edu took the zephyr-config update on Monday, without incident. Today (Wednesday) zephyr was working fine between 11:00 and 15:00, and I had subs until 22:04.

At 22:0?, I tried to send a zephyrgram, and got an error to the effect that there was no zhm. I checked with ps, and indeed it was so. I did a clumsy job of restarting it (first restarted it as cfox, then killed that one and restarted as root) and when it was running, I no longer had subs. (So I lost subs at some point between 22:04 and 22:15.)

Is there anything in /var/log/daemon.log? (Everything else I can think of, we checked over zephyr, but zhm's normal/abnormal exit messages should be listed there.)

Since there's nothing in debathena-zephyr-config that ever kills zhms, and certainly nothing in the 1.10 change to that effect, and if zhm exits gracefully it should syslog saying as much, unless you find a syslog message in daemon.log I have no idea what could have happened short of a memory leak.

I'm considering putting a wrapper around zhm to track its starts (and why) and stops (and why) and log all of that -- I ended up with two zhms on my Squeeze server last week, without any network state change and without having taken 1.10. I think our most recent plan to fix #133 (the actual two-zhm bug) involves a wrapper script anyway. Once that's in place, I think we'll have more of an idea of what to do if you can ever reproduce this.

comment:17 Changed 13 years ago by cfox

Ah, I think we can blame the network blip that occurred at 18:48:

Jun 29 18:48:23 red-herring NetworkManager: <info> (eth0): writing resolv.conf to /sbin/resolvconf
Jun 29 18:48:23 red-herring NetworkManager: <info> (eth0): device state change: 7 -> 8 (reason 0)
Jun 29 18:48:23 red-herring NetworkManager: <info> (eth0): writing resolv.conf to /sbin/resolvconf
Jun 29 18:48:23 red-herring NetworkManager: <info> Policy set 'Auto eth0' (eth0) as default for routing and DNS.
Jun 29 18:48:23 red-herring NetworkManager: <info> Activation (eth0) successful, device activated.
Jun 29 18:48:23 red-herring NetworkManager: <info> Activation (eth0) Stage 5 of 5 (IP Configure Commit) complete.
Jun 29 18:48:23 red-herring hm[23627]: Terminate signal caught...

Jun 29 22:14:40 red-herring hm[11076]: Terminate signal caught...

(The second one is my clumsy attempt to start one, that involved killing the one started as cfox.)

So the only puzzlement is why I had subscriptions between 18:48 and 22:04 - if I didn't lose them over the network blip, why did a later zhm start/restart harsh them?

comment:18 Changed 13 years ago by jdreed

Deployed 6/29. This can get closed if we're satisfied that the behavior Camilla saw was a fluke.

comment:19 Changed 13 years ago by jdreed

  • Status changed from proposed to closed
  • Resolution set to fixed
Note: See TracTickets for help on using tickets.