Prometheus
Alert notifications from Prometheus Alertmanager

Prerequisites:

This section explains on how to integrate and ingest alerts from Prometheus monitoring tool into CloudFabrix AIOPs platform.
Prometheus Alertmanager is alert management component which supports alert notifications via email, slack, webhook and others. CloudFabrix AIOPs platform uses webhook notification method over HTTP protocol to receive and ingest the alerts or events.
Click here for Alert Sources to create a Webhook URL for Prometheus alert notifications in CloudFabrix OIA application.

Prometheus Alert Rules Configuration:

Below is the sample configuration to define Alert threshold rules to trigger alerts for monitored assets with alert rules configuration file. (Note: Below alert trigger rules for reference only)
1
groups:
2
- name: ALERTENGINE
3
rules:
4
- alert: ALERT_MANAGER_FAILURES
5
expr: rate(alertmanager_notifications_failed_total[5m]) > 0
6
labels:
7
severity: CRITICAL
8
category: ALERTING
9
annotations:
10
title: Alertmanager is failing to send notications
11
description: Alertmanager is seeing errors {{$labels.integration}}
12
13
- name: CATASTROPHIC
14
rules:
15
- alert: HOST_DOWN
16
expr: avg_over_time(up{job=~"Hosts|Containers"}[2m]) == 0
17
labels:
18
severity: CRITICAL
19
category: AVAILABILITY
20
annotations:
21
summary: "{{$labels.instance}}: Host is unreachable. Host could be down. The Collecors are not accessible. If the host is up, make sure collectors are running."
22
description: "{{$labels.instance}}: Host is unreachable. Host could be down. The Collecors are not accessible. If the host is up, make sure collectors are running."
23
24
- name: HOST
25
rules:
26
- alert: HOST_HIGH_MEMORY_USAGE
27
expr: (((avg_over_time(node_memory_MemTotal_bytes[5m]) - avg_over_time(node_memory_MemFree_bytes[5m]) - avg_over_time(node_memory_Cached_bytes[5m])) / (avg_over_time(node_memory_MemTotal_bytes[5m])) * 100)) > 80
28
labels:
29
severity: HIGH
30
category: HOST_MEMORY
31
annotations:
32
summary: "{{$labels.instance}}: Memory Usage detected above 80"
33
description: "{{$labels.instance}}: Memory usage usage is above 80% (Current Used Memory % is: {{ $value }})"
34
35
- alert: HOST_HIGH_DISK_USAGE
36
expr: ((avg_over_time(node_filesystem_size_bytes{fstype=~"(ext.|xfs)"}[5m]) - avg_over_time(node_filesystem_free_bytes{fstype=~"(ext.|xfs)"}[5m])) * 100 / avg_over_time(node_filesystem_size_bytes{fstype=~"(ext.|xfs)"}[5m])) > 70
37
labels:
38
severity: HIGH
39
category: HOST_DISK
40
annotations:
41
summary: "{{$labels.instance}}: Disk {{$labels.device}} Usage detected above 70"
42
description: "{{$labels.instance}}: Disk {{$labels.device}} usage usage is above 70% (Current Disk Used % is: {{ $value }})"
43
44
- alert: HOST_HIGH_CPU_USAGE
45
expr: (100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 70
46
labels:
47
severity: HIGH
48
category: HOST_CPU
49
annotations:
50
summary: "{{$labels.instance}}: CPU Usage detected above 70"
51
description: "{{$labels.instance}}: CPU usage usage is above 70% (Current CPU % is: {{ $value }})"
52
53
- alert: HOST_HIGH_DISK_UTILIZATION
54
expr: rate(node_disk_io_time_seconds_total[5m]) / 10 > 90
55
labels:
56
severity: HIGH
57
category: HOST_DISK
58
annotations:
59
summary: "{{$labels.instance}}: Disk ( {{ $labels.device }} ) utilization is very high."
60
description: "{{$labels.instance}}: Disk ( {{ $labels.device }} ) utilization is very high. (Current Utilization is: {{ $value }})"
61
62
63
- alert: HOST_HIGH_DISK_INODE
64
expr: avg_over_time(node_filesystem_files_free{fstype=~"(ext.|xfs)"}[5m]) / avg_over_time(node_filesystem_files{fstype=~"(ext.|xfs)"}[5m]) * 100 <= 20
65
labels:
66
severity: HIGH
67
category: HOST_DISK
68
annotations:
69
summary: "{{$labels.instance}}: Disk ( {{ $labels.device }} ) High number of inode usage"
70
description: "{{$labels.instance}}: Disk ( {{ $labels.device }} ) High number of inode usage. (Current value is: {{ $value }})"
71
Copied!

Prometheus Alertmanager Configuration for Alert Notifications:

Below is the sample configuration for Prometheus alertmanager to send alert notifications to CloudFabrix AIOps platform over Webhook URL. (config.yml)
1
route:
2
repeat_interval: 1m
3
receiver: cfx-webhook
4
5
receivers:
6
- name: cfx-webhook
7
webhook_configs:
8
- url: 'https://<cfx-aiops-webhook-URL>'
9
send_resolved: true
10
http_config:
11
# basic_auth:
12
# username: <optional>
13
# password: <optional>
14
tls_config:
15
insecure_skip_verify: true
16
Copied!