Online baby monitor company Nanit have implemented a Vigilance Control to ensure that their monitoring systems are working properly. Nanit have described how they implemented such a system, and how easy it is to build effective tools by integrating AWS services together. InfoQ sat down with one of the creators, Miedwar Meshbesher, to discuss the implementation in more detail.
With many open-source and paid tools available to do the job, it can be relatively straightforward to make sure that your systems are monitored properly. But, how does a team make sure that these systems are working as described, and alert the team effectively that there’s a problem with the system that is supposed to be keeping an eye on things?
Nanit previously had a system to monitor their monitoring system – which is built around open-source tools Grafana and Prometheus – but this was retired due to some design issues. It didn’t maintain a history of previous status checks which prevented Nanit’s SRE/DevOps team from having flexibility in defining what is a failure, and also contained a single point of failure in the design.
Instead, Nanit implemented a Vigilance Control (referred to as a Dead Man’s Switch in their blog post) – a concept whereby the system being monitored sends a heartbeat to another monitoring system - and if that is not received then an alarm is sent.
In a detailed article containing all the relevant code snippets, Miedwar Meshbesher – a senior backend engineer also responsible for SRE/DevOps – describes the full design. The switch is implemented by Terraform resources, and uses AWS services and integrations to form a full monitoring monitoring system. Broadly, the process for implementing the control is described in the following six steps:
- Grafana sends a heartbeat to API Gateway
- API Gateway triggers a Lambda function
- CloudWatch monitors that function’s invocation rate
- If heartbeats stop coming, CloudWatch triggers an alarm
- The alarm sends a notification to PagerDuty using an SNS integration
- The on-call person is alerted and fixes the monitoring system
The key to this system is deliberately creating a constantly-failing metric within Prometheus - and then connecting this via Grafana with a webhook to the Lambda function which triggers a check to make sure that the alert is not stale.
Nanit is a smart baby monitor that uses computer vision to help babies and parents sleep better by analyzing the baby’s sleep and breathing motion, and providing insights and tips on how to improve sleep. InfoQ spoke to Miedwar about Nanit, and his role in building the dead man’s switch.
InfoQ: What was the motivation for creating a vigilance control?
Miedwar Meshbesher: We’re using a vigilance control as a way to monitor our monitoring systems. We rely on Prometheus and Grafana for alerting and we wanted to get notified if these go down so we don’t get a false sense of security. We were also looking for a quick win and didn’t want to deal with changing our network topology. The nice thing about this implementation is that it calls a Lambda function that’s publically available.
InfoQ: What lessons did you learn in implementing the vigilance control?
Miedwar: It’s an example of how easy it is to build something by just integrating a bunch of different AWS services. We’re using the fact that Lambda functions are monitored with CloudWatch and you can integrate that with SNS to trigger an HTTPS request to PagerDuty. It actually took longer to write the blog post than the Terraform scripts.
InfoQ: If you needed to implement the switch from nothing again, would you change anything? Is there any newer tech that you’d change?
Miedwar: The implementation is still pretty fresh, not a lot has changed, another option would be to use ALB (Application Load Balancer) instead of API Gateway, it might be a bit less Terraform code.
Miedwar’s comprehensive blog post on implementing the vigilance control is available on Nanit Engineering’s Medium blog.