2 minute read

Step1: Scope the problem

some good questions:

  1. who we are building the system for?
  2. what metrics we want to collect?
  3. what’s the scale of infrastructure

Let’s assume here is the key features of our system:

  1. infrastructure being monitored is large-scale
  2. variety of metrics can be monitored: CPU usage, Request counts etc.

Step2: High level design

A metrics monitoring system should generally contains 5 components:

  1. Data Collection
  2. Data Transmission
  3. Data Storage
  4. Alerting
  5. Visualization

Data Model

Metrics Data is usually recorded as a time series that contains a set of values with their timestamps.

Every time series consists of following:

  1. a metrics name
  2. a set of tags/labels
  3. an array of values and their timestamps

Data Access Pattern

Write load is heavy. At the same time, read time is spiky.

Data Storage System

It’s not recommended to build your own storage system or use a general purpose storage system.

There are many storage system that are optimized for time-series data.

According to DB-engines, the two most popular time-series databases are influx DB and Prometheus.

High Level Architecture

Step3: Design deep dive

Metrics Collection

Pull v.s. Push mode

There are two ways metrics data can be collected. and there is no clear answer which one is better.

  • pull model:
    • metrics collector fetches the configuration metadata of service endpoints from service discover.
    • metrics collector pulls metrics data via pre-defined HTTP
    • a single collector is not able to handle thousands of servers. We can designate each collector to a range in a consistent hash ring.
  • push model
    • a collection client is commonly installed on every server being monitored. client is a piece of long-running software that collects metrics from service
    • aggregation is an effective way to reduce the volume of data sent to metrics collector.

Metrics Transmission

There is a risk of data loss if the time-series database is unavailable, to mitigate the problem, we should introduce a message queue between data collector and time series DB

Alerting System

  1. rules are defined as config files on the disk
  2. load config file to alert manager
  3. based on config rules, alert manager calls the query service
  4. if the value violate the threshold, an alert event is created
  5. send notification and ensure it’s sent at least once
  6. eligible alerts are inserted into message queue
  7. alert consumer pull alerts event from message queue

Step4: Wrap Up

In this article we go through how to design a monitoring system.

Some takeaways:

  1. A monitoring system should have 5 components:
    1. data collection
    2. data transmission
    3. time series DB
    4. alert
    5. visualization
  2. data collection has two possible mode: pull or push, without preference, each one has its advantage and drawbacks
  3. data transmission might have data loss risk, so we introduce message queue to mitigate it
  4. time series DB is specialized for monitoring system
  5. the basic flow of alert service: rule config -> query -> notification