Plague of (ntpdate) zombies

Plague of (ntpdate) zombies

log in

Advanced search

Questions and Answers : Unix/Linux : Plague of (ntpdate) zombies

Author Message
talister
Send message
Joined: 17 May 08
Posts: 6
Credit: 50,699
RAC: 0
Message 1006 - Posted: 16 Feb 2009, 17:12:33 UTC

Having received my sensor it's been happily running but I have now discovered I have 106 ntpdate processes in zombie state on my machine (Fedora 10). I already run ntpd as a matter of course and it's well synchronized; are the two interacting in some bad way ?

Carl Christensen
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Jan 08
Posts: 1039
Credit: 11,336
RAC: 0
Message 1007 - Posted: 16 Feb 2009, 18:44:00 UTC - in response to Message 1006.

hmm, I'll take a look, it seems your computer never time syncs with our server:

http://qcn.stanford.edu/qcnalpha/result.php?resultid=293819

but still it shouldn't leave these zombie procs around, I'll have to check the cleanup on my thread. it is really just an "ntpdate" exec so shouldn't conflict with your own ntpd but something is obviously wrong.

cbahimsa
Send message
Joined: 28 Jan 09
Posts: 7
Credit: 46,535
RAC: 0
Message 1250 - Posted: 21 Mar 2009, 4:00:17 UTC - in response to Message 1007.

Mine has been doing that too ever since, exactly since, QCN. Am running the latest Jaunty 64 alpha but it was doing it before. When I reboot, they are all gone. Then they return, more and more, longer I stay on without rebooting.

Rudy Toody
Send message
Joined: 22 Mar 09
Posts: 8
Credit: 7,925
RAC: 0
Message 1401 - Posted: 12 Apr 2009, 8:02:43 UTC - in response to Message 1007.
Last modified: 12 Apr 2009, 8:14:58 UTC

hmm, I'll take a look, it seems your computer never time syncs with our server:

http://qcn.stanford.edu/qcnalpha/result.php?resultid=293819

but still it shouldn't leave these zombie procs around, I'll have to check the cleanup on my thread. it is really just an "ntpdate" exec so shouldn't conflict with your own ntpd but something is obviously wrong.

This is caused by the parent process not calling wait() to acknowledge the termination of the child process.

ntpdate is being phased out, to be replaced by ntpd -q.

However, I have determined that if ntpd is running or sleeping that the invoking of ntpd -q doesn't execute. I am still researching, but I would like to be able to have the QCN process ask my normally installed ntpd for the info it needs. That way, I would have only one copy running. I would need to know the time server you prefer.

read the ntp-doc that you can install in Linux to find all the answers.

Edit: nptdate has some conflicts with ntpd and does not set the clock when both are running.

Carl Christensen
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Jan 08
Posts: 1039
Credit: 11,336
RAC: 0
Message 1418 - Posted: 17 Apr 2009, 14:29:09 UTC - in response to Message 1401.

thanks Fred (Rudy Toody) - I think as you noted my cmd line is being truncated so I will look into a fix for that. I just run ntpdate to get a "local" offset from our servers, so it isn't trying to really update your clock etc (just gives me a number offset I can report to adjust event times etc).

talister
Send message
Joined: 17 May 08
Posts: 6
Credit: 50,699
RAC: 0
Message 1419 - Posted: 17 Apr 2009, 16:32:00 UTC - in response to Message 1418.

I can confirm this. I have ntpd running as a matter of course. Attempts to run ntpdate to a local stratum 2 server produces:
# ntpdate ntp1.ucsb.edu
17 Apr 09:28:19 ntpdate[20417]: the NTP socket is in use, exiting

Carl Christensen
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Jan 08
Posts: 1039
Credit: 11,336
RAC: 0
Message 1420 - Posted: 18 Apr 2009, 16:20:00 UTC - in response to Message 1419.

can you try ntpdate with all the options I use, either the one I distribute via BOINC or your "local" ntpdate, i.e.

/var/lib/boinc-client/projects/qcn.stanford.edu_qcnalpha/ntpdate_4.2.4p5_i686-pc-linux-gnu -p 8 -t 20 -u -b -q qcn-upl.stanford.edu

or

ntpdate -p 8 -t 20 -u -b -q qcn-upl.stanford.edu


because this way supposedly shouldn't conflict with current running ntpd's (the -q flag) -- it's just an informative msg that returns with a time offset between your machine & our server

so what "Rudy Toody" discovered is it seems my command line above is getting truncated which would cause a conflict so my ntpdate process dies and "zombies"

talister
Send message
Joined: 17 May 08
Posts: 6
Credit: 50,699
RAC: 0
Message 1432 - Posted: 22 Apr 2009, 14:17:56 UTC - in response to Message 1420.

Just to prove me wrong both of those work fine:
~/tarfiles/incoming/BOINC/projects/qcn.stanford.edu_qcnalpha/ntpdate_4.2.4p5_i686-pc-linux-gnu -p 8 -t 20 -u -b -q qcn-upl.stanford.edu
server 171.64.173.104, stratum 3, offset -0.012550, delay 0.04921
22 Apr 06:34:09 ntpdate_4.2.4p5_i686-pc-linux-gnu[18910]: step time server 171.64.173.104 offset -0.012550 sec

/usr/sbin/ntpdate -p 8 -t 20 -u -b -q qcn-upl.stanford.edu
server 171.64.173.104, stratum 3, offset -0.013116, delay 0.04881
22 Apr 06:34:26 ntpdate[18913]: step time server 171.64.173.104 offset -0.013116 sec

Could it be a cmdline length issue ? I make the full pathname of the first command 144 chars - doesn't seem overly long though.

Carl Christensen
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Jan 08
Posts: 1039
Credit: 11,336
RAC: 0
Message 1435 - Posted: 22 Apr 2009, 18:14:16 UTC - in response to Message 1432.
Last modified: 22 Apr 2009, 22:36:33 UTC

great, that means somehow my linux command-line is getting truncated as Rudy Toody thought. because the -q option shouldn't conflict with anybody's ntpd (if running) etc; I just "grep" the offset to get a better estimate of the time.

I'll try and get a new build up today or tomorrow (probably should have done before declaring qcn "non-alpha"!)

Edit: OK, it was a dumb error I had in the path name for Linux, so I made a new version and put it up. you can wait until you get a new workunit or "abort" your current one to get it. seems to sync OK on my VMWare test:

http://qcn.stanford.edu/qcnalpha/trigger.php?hostid=6108

Rudy Toody
Send message
Joined: 22 Mar 09
Posts: 8
Credit: 7,925
RAC: 0
Message 1444 - Posted: 23 Apr 2009, 3:46:39 UTC - in response to Message 1435.

I have aborted and reset and still get the truncations.

Carl Christensen
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Jan 08
Posts: 1039
Credit: 11,336
RAC: 0
Message 1445 - Posted: 23 Apr 2009, 4:31:42 UTC - in response to Message 1444.

well it seems to be doing the time sync but leaving zombies; I think it's because of the customized popen pipe stuff I'm doing which isn't cleaning up after itself nicely, so I'll try a waitpid() and see if that fixes it.

mine has been getting sync times on a few Linux PCs (but still leaving zombies behind):

5172469 qcne_sc300_sta200_015414_0 23 Apr 2009 4:21:43 UTC 23 Apr 2009 4:21:46 UTC 23 Apr 2009 4:17:23 UTC -0.000476 3.528071 0.296161 37.887708 -122.277847 0.02 2 JoyWarrior 24F8 USB 4.76 7.38 1379.66
5172318 qcne_sc300_sta200_015414_0 23 Apr 2009 4:12:23 UTC 23 Apr 2009 4:12:27 UTC 23 Apr 2009 4:02:22 UTC -0.006883 12.872495 1.516854 37.887708 -122.277847 0.02 2 JoyWarrior 24F8 USB 4.76 6.11 819.06
5172271 qcne_sc300_sta200_015414_0 23 Apr 2009 4:09:26 UTC 23 Apr 2009 4:09:30 UTC 23 Apr 2009 4:02:22 UTC -0.006883 3.594747 0.328485 37.887708 -122.277847 0.02 0 JoyWarrior 24F8 USB 4.76 3.52 642.8
5172260 qcne_sc300_sta200_015414_0 23 Apr 2009 4:08:17 UTC 23 Apr 2009 4:08:22 UTC 23 Apr 2009 4:02:22 UTC -0.006883 4.422219 0.324538 37.887708 -122.277847 0.02 0 JoyWarrior 24F8 USB 4.76 3.2 573.9

Rudy Toody
Send message
Joined: 22 Mar 09
Posts: 8
Credit: 7,925
RAC: 0
Message 1446 - Posted: 23 Apr 2009, 14:48:04 UTC - in response to Message 1445.
Last modified: 23 Apr 2009, 15:00:02 UTC

I detached and re-attached and I still get no time-sync. The truncation still exists. It's like I haven't gotten the new version.

Edit to remove a date.

Carl Christensen
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Jan 08
Posts: 1039
Credit: 11,336
RAC: 0
Message 1456 - Posted: 24 Apr 2009, 1:56:19 UTC - in response to Message 1446.

odd, I just put up a new version (4.80) in which I doubled the size of the cmd line (which should be much bigger than required), see if that works?

talister
Send message
Joined: 17 May 08
Posts: 6
Credit: 50,699
RAC: 0
Message 1457 - Posted: 24 Apr 2009, 6:23:46 UTC - in response to Message 1456.

Nope still getting zombies every 15 mins. This is with 4.80. I quit and restarted BOINC but this didn't alter things.

Carl Christensen
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Jan 08
Posts: 1039
Credit: 11,336
RAC: 0
Message 1461 - Posted: 24 Apr 2009, 17:36:25 UTC - in response to Message 1457.

that's odd, I get no zombies & time sync OK with 4.80 on my Ubuntu & Debian tests (run for about an hour). I'm afraid I won't have too much time to work on Linux until I move back to Oxford in the next month; where I'll have more Linux boxes to play with. If the zombies get to be hogging up your machine you may want to suspend until I figure out what's happening (but I'm using waitpid so that should clear up any zombies, and it seems to be time syncing OK so I don't see where it would have crashed or anything).

Rudy Toody
Send message
Joined: 22 Mar 09
Posts: 8
Credit: 7,925
RAC: 0
Message 1465 - Posted: 24 Apr 2009, 19:33:53 UTC - in response to Message 1461.

I have no zombies at all! It doesn't appear that the ntpdate process is ever invoked (I haven't seen anything pop up on the system monitor.) The error file shows a non-sync every three minutes.


Post to thread

Questions and Answers : Unix/Linux : Plague of (ntpdate) zombies


Return to Quake-Catcher Network Sensor Monitoring main page


Copyright © 2013 Stanford University