Ticket #340 (closed enhancement: fixed)

Opened 12 years ago

Last modified 12 years ago

Metrics gathering

Reported by: jdreed Owned by: broder
Priority: normal Milestone: Fall 2009 Release
Component: -- Keywords:
Cc: Fixed in version:
Upstream bug:

Description

Per discussion at release team, we need a way to identify what Athena is being used for. To start with, we need to be able to differentiate between "Email", "Web", "Academic Software" (defined as some popular -thirdparty packages), and "Other". Possibly also "Writing Code".

Attachments

connector.c Download (2.4 KB) - added by broder 12 years ago.

Change History

comment:1 Changed 12 years ago by geofft

From the technical side, I have two tries at an implementation of this, using two Linux APIs, inotify and cn_proc. They're in  /mit/geofft/debathena/inotify.c and  /mit/geofft/debathena/connector.c respectively.

inotify watches a small number of directories we give it and reports back events on open, access, and close, so watching /bin and /usr/bin would be a first-order way to find what applications are used.

cn_proc, via the kernel's less-than-clearly-named "connector" netlink interface, reports on every process creation (fork), exec, and close, so we can pretty accurately track which processes exist and for how long. I prefer this method because this is the "right" API for this; it gives us better accuracy and better resilience to lost events (which can happen with either method).

I haven't yet looked at the significantly less technical side of this, which is to modify either of these programs to look for the processes we care about, categorize them, add up how long each one was run, and then submit this to some central server.

comment:2 Changed 12 years ago by broder

I'm coming around to using the connector instead of the other interfaces we've looked at. It allows us to see what programs are being run out of AFS as well as off the local system, which may be useful since alexp has expressed skepticism of the wrapper scripts' stats in the past.

Attached is the version of geofft's connector.c that I've been working with. It should do a good job of batching large numbers of fork/exec pairs happening in a small period of time, while still not allowing events to queue up for seconds worth of processing time. It's also only catching exec calls, and only printing the executable path. It probably does want to have better error handling. We may also want to consider filtering based on non-system UIDs (i.e. only print out programs with a UID >1000 or something)

The intent is to run this under a script in a higher-level language, which reads off of the program's stdout and handles the collection, batching, and submission.

The performance hit seems pretty miniscule - about 0.4% for Anders' degenerate fork bomb:

kid-icarus:~/src/moira broder$ time seq 5000 | xargs -I% true

real	0m10.474s
user	0m4.044s
sys	0m5.684s
kid-icarus:~/src/moira broder$ time seq 5000 | xargs -I% true

real	0m10.518s
user	0m4.788s
sys	0m4.852s

Changed 12 years ago by broder

comment:3 Changed 12 years ago by broder

geofft pointed out that apparently I need to be using recvfrom or else I'll get spurious wakeups, so new version attached.

comment:4 Changed 12 years ago by broder

  • Status changed from new to proposed

I've added debathena-metrics to debathena-cluster and uploaded both to proposed.

I've also been in touch with Jonathon to verify that none of the various public syslog reports will include metrics data.

comment:5 Changed 12 years ago by broder

  • Status changed from proposed to closed
  • Resolution set to fixed

This has been moved into production, along with documentation on the privacy concerns:  http://kb.mit.edu/confluence/x/TQlS

Note: See TracTickets for help on using tickets.