Prometheus

Alert notifications from Prometheus Alertmanager

Prerequisites:

This section explains on how to integrate and ingest alerts from Prometheus monitoring tool into CloudFabrix AIOPs platform.

Prometheus Alertmanager is alert management component which supports alert notifications via email, slack, webhook and others. CloudFabrix AIOPs platform uses webhook notification method over HTTP protocol to receive and ingest the alerts or events.

Click here for Alert Sources to create a Webhook URL for Prometheus alert notifications in CloudFabrix OIA application.

Prometheus Alert Rules Configuration:

Below is the sample configuration to define Alert threshold rules to trigger alerts for monitored assets with alert rules configuration file. (Note: Below alert trigger rules for reference only)

groups:
- name: ALERTENGINE
  rules:
  - alert: ALERT_MANAGER_FAILURES
    expr: rate(alertmanager_notifications_failed_total[5m]) > 0
    labels:
      severity: CRITICAL
      category: ALERTING
    annotations:
      title: Alertmanager is failing to send notications
      description: Alertmanager is seeing errors {{$labels.integration}}

- name: CATASTROPHIC
  rules:
  - alert: HOST_DOWN
    expr: avg_over_time(up{job=~"Hosts|Containers"}[2m]) == 0
    labels:
      severity: CRITICAL
      category: AVAILABILITY
    annotations:
      summary: "{{$labels.instance}}: Host is unreachable. Host could be down. The Collecors are not accessible. If the host is up, make sure collectors are running."
      description: "{{$labels.instance}}: Host is unreachable. Host could be down. The Collecors are not accessible. If the host is up, make sure collectors are running."

- name: HOST
  rules:
  - alert: HOST_HIGH_MEMORY_USAGE
    expr: (((avg_over_time(node_memory_MemTotal_bytes[5m]) - avg_over_time(node_memory_MemFree_bytes[5m]) - avg_over_time(node_memory_Cached_bytes[5m])) / (avg_over_time(node_memory_MemTotal_bytes[5m])) * 100)) > 80
    labels:
      severity: HIGH
      category: HOST_MEMORY
    annotations:
      summary: "{{$labels.instance}}: Memory Usage detected above 80"
      description: "{{$labels.instance}}: Memory usage usage is above 80% (Current Used Memory % is: {{ $value }})"

  - alert: HOST_HIGH_DISK_USAGE
    expr: ((avg_over_time(node_filesystem_size_bytes{fstype=~"(ext.|xfs)"}[5m]) - avg_over_time(node_filesystem_free_bytes{fstype=~"(ext.|xfs)"}[5m])) * 100 / avg_over_time(node_filesystem_size_bytes{fstype=~"(ext.|xfs)"}[5m])) > 70
    labels:
      severity: HIGH
      category: HOST_DISK
    annotations:
      summary: "{{$labels.instance}}: Disk {{$labels.device}} Usage detected above 70"
      description: "{{$labels.instance}}: Disk  {{$labels.device}} usage usage is above 70% (Current Disk Used % is: {{ $value }})"

  - alert: HOST_HIGH_CPU_USAGE
    expr: (100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 70
    labels:
      severity: HIGH
      category: HOST_CPU
    annotations:
      summary: "{{$labels.instance}}: CPU Usage detected above 70"
      description: "{{$labels.instance}}: CPU usage usage is above 70% (Current CPU % is: {{ $value }})"

  - alert: HOST_HIGH_DISK_UTILIZATION
    expr: rate(node_disk_io_time_seconds_total[5m]) / 10 > 90
    labels:
      severity: HIGH
      category: HOST_DISK
    annotations:
      summary: "{{$labels.instance}}: Disk ( {{ $labels.device }} ) utilization is very high."
      description: "{{$labels.instance}}: Disk ( {{ $labels.device }} ) utilization is very high. (Current Utilization is: {{ $value }})"


  - alert: HOST_HIGH_DISK_INODE
    expr: avg_over_time(node_filesystem_files_free{fstype=~"(ext.|xfs)"}[5m]) / avg_over_time(node_filesystem_files{fstype=~"(ext.|xfs)"}[5m]) * 100 <= 20
    labels:
      severity: HIGH
      category: HOST_DISK
    annotations:
      summary: "{{$labels.instance}}: Disk ( {{ $labels.device }} ) High number of inode usage"
      description: "{{$labels.instance}}: Disk ( {{ $labels.device }} ) High number of inode usage. (Current value is: {{ $value }})"

Prometheus Alertmanager Configuration for Alert Notifications:

Below is the sample configuration for Prometheus alertmanager to send alert notifications to CloudFabrix AIOps platform over Webhook URL. (config.yml)

route:
  repeat_interval: 1m
  receiver: cfx-webhook

receivers:
- name: cfx-webhook
  webhook_configs:
  - url: 'https://<cfx-aiops-webhook-URL>'
    send_resolved: true
    http_config:
#      basic_auth:
#        username: <optional>
#        password: <optional>
      tls_config:
        insecure_skip_verify: true

Last updated