This section explains on how to integrate and ingest alerts from Prometheus monitoring tool into CloudFabrix AIOPs platform.
Prometheus Alertmanager is alert management component which supports alert notifications via email, slack, webhook and others. CloudFabrix AIOPs platform uses webhook notification method over HTTP protocol to receive and ingest the alerts or events.
Click here for Alert Sources to create a Webhook URL for Prometheus alert notifications in CloudFabrix OIA application.
Prometheus Alert Rules Configuration:
Below is the sample configuration to define Alert threshold rules to trigger alerts for monitored assets with alert rules configuration file. (Note: Below alert trigger rules for reference only)
groups:
- name: ALERTENGINE
rules:
- alert: ALERT_MANAGER_FAILURES
expr: rate(alertmanager_notifications_failed_total[5m]) > 0
labels:
severity: CRITICAL
category: ALERTING
annotations:
title: Alertmanager is failing to send notications
description: Alertmanager is seeing errors {{$labels.integration}}
- name: CATASTROPHIC
rules:
- alert: HOST_DOWN
expr: avg_over_time(up{job=~"Hosts|Containers"}[2m]) == 0
labels:
severity: CRITICAL
category: AVAILABILITY
annotations:
summary: "{{$labels.instance}}: Host is unreachable. Host could be down. The Collecors are not accessible. If the host is up, make sure collectors are running."
description: "{{$labels.instance}}: Host is unreachable. Host could be down. The Collecors are not accessible. If the host is up, make sure collectors are running."
- name: HOST
rules:
- alert: HOST_HIGH_MEMORY_USAGE
expr: (((avg_over_time(node_memory_MemTotal_bytes[5m]) - avg_over_time(node_memory_MemFree_bytes[5m]) - avg_over_time(node_memory_Cached_bytes[5m])) / (avg_over_time(node_memory_MemTotal_bytes[5m])) * 100)) > 80
labels:
severity: HIGH
category: HOST_MEMORY
annotations:
summary: "{{$labels.instance}}: Memory Usage detected above 80"
description: "{{$labels.instance}}: Memory usage usage is above 80% (Current Used Memory % is: {{ $value }})"
- alert: HOST_HIGH_DISK_USAGE
expr: ((avg_over_time(node_filesystem_size_bytes{fstype=~"(ext.|xfs)"}[5m]) - avg_over_time(node_filesystem_free_bytes{fstype=~"(ext.|xfs)"}[5m])) * 100 / avg_over_time(node_filesystem_size_bytes{fstype=~"(ext.|xfs)"}[5m])) > 70
labels:
severity: HIGH
category: HOST_DISK
annotations:
summary: "{{$labels.instance}}: Disk {{$labels.device}} Usage detected above 70"
description: "{{$labels.instance}}: Disk {{$labels.device}} usage usage is above 70% (Current Disk Used % is: {{ $value }})"
- alert: HOST_HIGH_CPU_USAGE
expr: (100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 70
labels:
severity: HIGH
category: HOST_CPU
annotations:
summary: "{{$labels.instance}}: CPU Usage detected above 70"
description: "{{$labels.instance}}: CPU usage usage is above 70% (Current CPU % is: {{ $value }})"
- alert: HOST_HIGH_DISK_UTILIZATION
expr: rate(node_disk_io_time_seconds_total[5m]) / 10 > 90
labels:
severity: HIGH
category: HOST_DISK
annotations:
summary: "{{$labels.instance}}: Disk ( {{ $labels.device }} ) utilization is very high."
description: "{{$labels.instance}}: Disk ( {{ $labels.device }} ) utilization is very high. (Current Utilization is: {{ $value }})"
- alert: HOST_HIGH_DISK_INODE
expr: avg_over_time(node_filesystem_files_free{fstype=~"(ext.|xfs)"}[5m]) / avg_over_time(node_filesystem_files{fstype=~"(ext.|xfs)"}[5m]) * 100 <= 20
labels:
severity: HIGH
category: HOST_DISK
annotations:
summary: "{{$labels.instance}}: Disk ( {{ $labels.device }} ) High number of inode usage"
description: "{{$labels.instance}}: Disk ( {{ $labels.device }} ) High number of inode usage. (Current value is: {{ $value }})"
Prometheus Alertmanager Configuration for Alert Notifications:
Below is the sample configuration for Prometheus alertmanager to send alert notifications to CloudFabrix AIOps platform over Webhook URL. (config.yml)