# #monitoringlove # with Sensu ## OSMC 2014, Nuremberg, Germany ---  ## [http://sensuapp.org](http://sensuapp.org) Notes: Monitoring framework/router --- # Why Sensu? - Ruby + JSON - Scalable architecture - Plugins in any language - Can use Nagios checks - Collects both checks and metrics - Great community Notes: EventMachine --- # Jochen Lillich ## http://freistil.it ## @geewiz Notes: Started using Nagios but hit load and queueing problems DevOps Days 2012, Ulf Månsson: #monitoringlove --- # Sensu Core Notes: Open Source, MIT license Created in part-time by Sean Porter at Sonian, now maintained full-time at Heavy Water Operations --- # Sensu Enterprise Notes: Improved performance Metrics conversion Third-party integrations Commercial support --- # Installation - Omnibus packaging - Configuration in JSON files - [Sensu cookbook for Chef](https://github.com/sensu/sensu-chef) - [Puppet module](https://github.com/sensu/sensu-puppet) Notes: Automation: Never forget to add new machines ---  ---  ---  - Connects all Sensu components - Asynchronous communication --- # Sensu Server - Orchestrates check execution - Processes check results - Triggers event handlers --- # Sensu Client - Registers automatically with the Server - Sends keepalive information - Receives check execution requests - Schedules checks locally - Executes checks - Publishes check results - Publishes external events --- # API - get event data - get agent data - trigger check execution - resolve events - silence checks Notes: REST-like interface --- # Dashboard - [Uchiwa](http://sensuapp.org/docs/latest/dashboards_uchiwa) - [sensu-admin](https://github.com/sensu/sensu-admin) ---  Notes: executed by the Sensu client Subscription Nagios protocol event only for non-zero check results type "event" triggers events always client-specific values --- # Scheduling - Standard checks (server) - Standalone checks (client) - Manual checks (API) --- ~~~json { "checks": { "disk_free": { "type": "status", "subscribers": [ "all" ], "handlers": [ "default" ], "command": "/usr/lib/nagios/plugins/check_disk -w :::disk_warn::: -c :::disk_crit::: -A -x /dev/shm -X nfs -i /boot", "interval": 60 } } } ~~~ Notes: `interval`, `occurrences`, `refresh`, `low_flap_threshold`, `high_flap_threshold` --- # Checks in Chef ~~~ruby sensu_check 'mysql_server' do command "/usr/lib/nagios/plugins/check_mysql " + "-u 'monitoring' " + "-p '#{node['mysql']['server_mon_password']}'" handlers ['default'] standalone true interval 30 end ~~~ --- # Metrics check ~~~json { "checks": { "load_metrics": { "type": "metric", "command": "load-metrics.rb", "subscribers": [ "production" ], "interval": 10 } } } ~~~ --- # Metrics output ~~~ $ ruby load-metrics.rb srv3.local.load_avg.one 0.89 1365270842 srv3.local.load_avg.five 1.01 1365270842 srv3.local.load_avg.fifteen 1.06 1365270842 $ echo $? 0 ~~~ --- # External events ~~~bash echo '{ "name": "my_check", "output": "some output", "status": 0 }' > /dev/tcp/localhost/3030 ~~~ Useful: https://github.com/solarkennedy/sensu-shell-helper ---  --- # Handler types - Pipe - TCP - UDP - Transport - Sets Notes: Pipe handlers are for executing a command (or script), passing it the event data via STDIN. TCP handlers are for writing event data to a TCP socket. UDP handlers are for writing event data to a UDP socket. Transport handlers are for publishing event data to a Sensu transport, such as RabbitMQ (default). Handler sets are for grouping handlers; a way to send the same event data to one or more handlers, or simply create an alias. --- # Common event handlers - Email - PagerDuty - Graphite - IRC - Slack ---  --- # Example handler code ~~~ruby #!/usr/bin/env ruby require 'rubygems' require 'json' # Read event data event = JSON.parse(STDIN.read, :symbolize_names => true) # Write the event data to a file file_name = "/tmp/sensu_#{event[:client][:name]}_" + "#{event[:check][:name]}" File.open(file_name, 'w') do |file| file.write(JSON.pretty_generate(event)) end ~~~ --- # Example handler configuration ~~~json { "handlers": { "file": { "type": "pipe", "command": "/etc/sensu/handlers/file.rb" } } } ~~~ --- # Sensu CLI [https://github.com/agent462/sensu-cli](https://github.com/agent462/sensu-cli) - `sensu-cli resolve srv3 apache_http` - `sensu-cli client delete srv3` - `sensu-cli silence srv3 --reason "Shut up already" --expire 3600` --- # #chatops [https://github.com/sensu/sensu-hubot](https://github.com/sensu/sensu-hubot) - `sensu events summarize` - `sensu events filter severity critical` - `sensu events filter subscription webservers` --- # Monitoring your monitoring - Check RabbitMQ ready queue! --- # Scaling Sensu --- # Scaling a single site - Sensu Server - Sensu API - RabbitMQ - Redis Notes: Run multiple sensu-server instances with the same RabbitMQ and Redis; automatic internal master election. Stateless HTTP frontend; traditional load-balancing strategies See [RabbitMQ clustering documentation](https://www.rabbitmq.com/clustering.html) Single master; multiple Redis instances for fault tolerance; see [Redis Sentinel](http://redis.io/topics/sentinel) --- # Multi-DC operation ---  Notes: All Sensu clients execute checks locally. Their only interaction with Sensu servers is to push events onto RabbitMQ. Therefore, remote clients can connect directly to a remote RabbitMQ broker over the WAN. - Very simple architecture, no additional infrastructure needed at remote sites - Centralized alert handling - Keepalive failures are now indistinguishable from WAN instability - Lots of remote clients means lots of TCP connections over the WAN - All clients appear to be in the same datacenter in Uchiwa ---  Notes: RabbitMQ [Federation plugin](https://www.rabbitmq.com/federation.html) or [Shovel plugin](https://www.rabbitmq.com/shovel.html) This is picking Availability and Partition Tolerance over Consistency with RabbitMQ. + Decreased infrastructure necessary at remote Datacenters + All Sensu server alerts originate from a single source - WAN instability can result in floods of client keepalive alerts; requires check dependencies - Increased RabbitMQ configuration complexity - All clients “appear” to be in the same datacenter in Uchiwa ---  Notes: - WAN instability does *not* lead to flapping sensu checks - Sensu operation continues un-interrupted during a WAN outage - The overall architecture is easier to understand and troubleshoot - WAN outages mean a whole Datacenter can go dark and not set off alerts (cross-datacenter checks are therefore essential) - WAN instability can lead to a lack of visibility as Uchiwa may not be able to connect to the remote Sensu APIs - Requires all the Sensu infrastructure in every datacenter --- # HA - [High availability monitoring with Sensu](http://failshell.io/sensu/high-availability-sensu/) --- # References --- # Community plugins https://github.com/sensu/sensu-community-plugins - Over 600 plugins - 80 contributors --- # Support - #sensu on FreeNode IRC - sensu-users mailing list - Commercial support from HeavyWater --- # Thank you! ## @geewiz ## jochen@freistil.it --- # Credits * Samuel Beckett Bridge by Miguel Mendez https://flic.kr/p/dyn2FU