We record a lot of metrics using graphite. Here is a brief description of the system we’ve settled on.
The services are deployed with debian packages (one for carbon, graphite-web, and whisper). This works pretty well since graphite integrates very well with OS python.
We have a lot of metrics in munin because it’s so easy for developers to write metrics which work on all systems. It’s kind of a shame that we have never found a way to make the higher-precision metrics work for us like this. Munin only polls every ive minutes and stops working if there’s ever any problems on the machine.
We tried a number of methods to integrate munin with graphite. I was not able to convert the rrd files, syncing the rrd files was too slow, and the munin-to-graphite pollers were all pretty buggy.
Eventually we created an NFS mount to the munin server which works very effectively with almost no configuration.
Applications write to statsd. This is much better than polling monitoring like munin because you can just throw in whatever metrics you want without having to work out how to store it. There are limitations in that it’s difficult to match metrics up, for example a start and a finish request event to one actual transaction. For this purpose polling is more effective.
We use very simple libraries to make this happen. So far only PHP does this. This is particularly useful because the PHP applications I maintain are not very well automated.
We’re also using mysql-statsd to do higher precision monitoring of mysql (but that would be just as doable writing to graphite directly I expect).
Statsd itself is installed from OS packages and is easy enough now that debian distributes a nodejs.
Collectd is a very effective, fast low-level metric collector. There is some trouble here because there is no graphite write plugin by default for some of the OSs we use.
We never got around to writing metric recording plugins for this because statsd is better for application metrics (because of its “push” style) and munin is easier for any-system metrics.
It’s a lot like kibana. We’re using the 2.0 release which has its own elasticsearch database. This package is great because the authors provide a full debian repository to install it; as do elasticsearch which we are using for the data store.
We have found a few bugs and annoyances in the fairly new 2.x series (for example no way to change the date range of an a single graph without creating a new dashboard) but it is by far the best graphite dashboard we’ve tried.
It would be very useful if we could generate graphs based on machine-specific data. Globs are only OK up to a point. The system taking the metric is often capable of generating a better dashboard than grafana. For example /dev/sda is the same as /dev/mapper/root_23424 so we shouldn’t have two graphs there. Munin will deal with this problem, but grafana can’t because it has no way to check which partitions the machine has. For our kind of small setup, it would work to have a glob and/or exported resources dashboard but then apply hard-coded overrides from the grafana database, but grafana doesn’t have this feature (yet?).
Using puppet for all configuration.
Distributing dashboards is a problem. We have written facter facts which are used to generate the templates on the graph and then export the resource which is collected on the graphing server. It’s really really awkward so we only do it for collectd which has very similar metrics across all deployments.
Gdash — works but pretty basic. The API is handy for trying out graphs.
Graph-explorer — an interesting attempt at getting graphs from metrics. The idea is you rename your metrics to say what units they are in etc. It doesn’t solve the problem of system-specific dashboards though, and it is extremely slow.
There is not enough I/O capacity for all the metrics we record (although it’s impressive how far we got on such a simple setup). We need to upgrade graphite and use ceres, which should be able to cluster the metrics to different hardware. Carbon cache can be used for this but it’s awkward because you have to sort out the sharding yourself.
It would be a lot nicer if we have a polling system which would write high-precision metrics but still be as easy to write as munin, perhaps using the same protocol. The metrics themselves also need to be suitable for generating graphs themselves. Graphite here needs to be dumbed down a bit so non-programmers are capable of easily writing metrics and graphs.
Somewhat related, we need a way of propagating machine-specific data to the graphing interface. It would be great to use grafana for this, but exporting data using puppet facts is extremely awkward because the thing taking the metric is usually a lot more intelligent at working out what it’s reporting.