The Netflix Engineering team recently blogged about Telltale, a monitoring and alerting tool that utilizes a variety of data sources to learn the typical health of an application. Monitoring the health of over 100 production facing Netflix applications, Telltale also serves as an intelligent incident management tool.
With metrics being very important to understand the application health, Telltale shows only the relevant data from application. There’s also information about important events, such as nearby deployments and regional traffic evacuations, which is essential from an application’s overall health aspect. To understand the health of application “at a glance,” different colors and numbers are used to indicate severity.
The “heart of Telltale” is the application health model, which captures signals from different sources. The view of the application is created based on the type of these signals. Some of this model’s sources include open-sourced Mantis, Netflix failover architecture Project Nimble, Netflix Streaming Supply Chain, alerts from the alerting system.
Telltale has a monitoring mechanism based on different algorithms: statistical, rule-based, or machine learning. There is no need for constant tuning of alerts sent out from the system. In addition to monitoring, Telltale’s alerts are context-aware, sending the notification to teams via Slack, email, or PagerDuty. The incident updates are also sent in Slack message threads, ensuring better communication about the application’s current state.
To provide a better context, when raising an incident alert, Telltale highlights possible causes. The post-incident review has Application Incident Summary showing all recent issues and total downtime, thereby creating an archive of incidents.
Taking the effectiveness of monitoring and alerting to the next level, teams use Telltale for safer deployments. When Netflix’s continuous delivery platform Spinnaker rolls out a new build, Telltale continuously monitors the health of the instances running the latest build. If any issue is detected, deployment stops and rolls back, ensuring a smaller blast radius with lesser deployment problem duration.
In related news, a study conducted by Digital Enterprise Journal reveals that 91% of companies reported missed revenue because of performance and availability problems. Telltale enables the Netflix streaming teams to quickly diagnose and remediate problems for uninterrupted “member joy”. Monitoring vendors are also paying attention to this trend. Dynatrace’s AI Engine, “Davis”, also identifies a broader set of issues in Azure or AWS environments and streams to Microsoft Azure Monitor or AWS CloudWatch dashboards, respectively.
There is no plan to open source Telltale anytime soon. The engineering team at Netflix plans to collaborate with their platform team to evaluate improvements in Telltale and continue to build new features.