Crisp's engineering team shared their experience in monitoring their microservices. Vigil, their open source monitoring project, is a set of pull/push probes to collect health data with support for multiple languages, a status dashboard and integration with some external alerting tools.
Crisp offers a live chat solution for websites. Crisp's monitoring toolset, called Vigil, consists of probes and a dashboard which displays the status of various microservices collected by the probes. Vigil's probes fall into two categories - poll and push. Poll probes periodically poll a service over TCP or HTTP, and check the response and response time against expected values. Push probes work by integrating with the microservice source code, and send periodic status information to Vigil from inside the service process. This pattern is common in monitoring systems and most systems support both with a focus on one. Vigil is written in Rust, and ran as an internal project for a couple of years before being released as open source.
Crisp serves more than a billion requests per month. Their backend has more than 40 different microservices, most of them non-HTTP. Inter-service communication happens over RabbitMQ. Some of the HTTP-based ones, like the REST API, are behind a load balancer. In addition, there are around 20 daemon processes like Postfix and MongoDB.
Each microservice runs on multiple nodes, and a node is identified by a replica identifier. A node's status can be obtained from the dashboard - healthy, sick or dead. A service node's 'sick' status is determined by either the reported system load (CPU or RAM) being above a threshold in push mode, and a service response taking too much time in poll mode. A dead status for a service indicates that it might be down.
InfoQ reached out to Valerian Saliou, CTO of Crisp, to find out more about how Vigil does internal as well as external monitoring:
When a node in the web of nodes goes down, we'll know as those microservice nodes are monitored in push mode, which means that if one goes down, it won't report and will quickly trigger a 'Down' notification from Vigil to our Slack and to the public status page, pinpointing the node that went down.
For external monitoring of end user endpoints, Vigil "checks the API at https://api.crisp.chat from a poll probe to check public access is OK", says Saliou, adding that "the same API microservice is also reporting via push, which is why you see two references to the API on the Crisp status page, under the 'Web' group and the 'Relay' group."
Vigil's push integration is supported in multiple languages: Rust, node and Go. It also integrates with third party services like Slack and email, but there is no support yet for other popular alerting systems like Nagios and PagerDuty. At Crisp, Vigil currently runs on a single node. Redundancy is not on the roadmap, since the goal is "to have a simple status page that does the job and give SaaS developers / sysadmins easy access to a status page that costs nothing", says Saliou.