Alert Correlation & Suppression
Alert correlation is the process of grouping together related alerts to reduce noise and increase actionability of alerts and events. Correlated alerts are grouped translated to CFX incidents, which are then routed to ITSM systems for handling by NOC/IT Analysts, who can then login to OIA (Operations Intelligence and Analytics) Incident Room module to perform swift triage, diagnosis and root cause analysis of an Incident.
- Ingested alerts and events are normalized to OIA alert model, to allow addressing most alerts/tool implementations
- Customers can add custom attributes to alert model using enrichment process
- Ingested alerts are enriched with context about application, stack, department, ownership, support-group etc. using a process called alert enrichment.
- Enriched alerts are then evaluated for any correlation or suppression to be performed. Suppression policies are used to suppress alerts that escape maintenance windows.
- Alerts that remain are then evaluated for correlation that is determined by correlation policies, which are setup in 3-ways
- 1.System defined policies: To address well-known behavior like alert burst and alert flapping situations.
- 2.ML driven correlation recommendations: OIA uses unsupervised ML clustering to detect alert patterns and provides list of suggested correlations in the form of Symptom Clusters.
- 3.Admin defined correlation policies: Administrators can define new correlation policies or customize existing policies to meet their needs. For instance, correlation policies allow admins to group alerts across a full-stack or an application instance. Admins can also group alerts across a common infrastructure (like network, storage etc.) or shared services (ex: SSO, DNS etc.).
Correlation policies are in enabled state when created, but can be disabled. Correlation policies determine how alerts can be grouped together. Most of the correlation policies can be created in an assisted-manner by recommendations provided by OIA's correlation engine with symptom clusters.
A correlation policies can result in one or more instances of alert correlations, each represented by an Alert Group
Following controls are available to specify correlation behavior.
Severity of alert group is always determined by the highest severity of alerts that it comprises of. However, if customers want to a certain minimum level of severity to alert groups formed by this correlation policy
Time boxing is the concept of grouping related alerts that fall within a certain time window, like 15-mins, 30-mins or 1-hour. The time window is started when first matching alert is detected and closed after the time window expires. Any new matching alert after time window expiration will result in new alert group instance formation and leading to a new incident.
Precedence values help determine which policy takes precedence when conflicts arise, which could arise when an alert matches multiple policies. For example, an alert belonging to symptom cluster "prod" and application "CMS" can match both policies that are setup to correlated alerts at application level (app-name == CMS) or at symptom cluster level (cluster-name == prod). By providing higher precedence to application-level policy, alerts can will be grouped at application level.
Precdence is numeric value, and higher values indicate higher precedence and take priority in case of match. Precedence values are optional, if not provided, system provides Precendence values automatically, based on chronological order i.e newly created correlation policies will get higher precedence.
A typical approach would be setup more wider or broad-scope correlation policies with higher precedence and more specific correlation policies to be with lower precedence.
Narrows down related alert selection criteria using a set of property filters that match property fields with specified values using conditions like (equals, contains, in list of values etc.)
Property filters allow fine grained control of correlation policies to meet organizational processes, administrative domains or functional groups.
Related alerts can be grouped by values in a certain attribute. This works best for attributes that are typically of type enumeration, list of values or represent a limited set of identities.
For example, assume Machine-Type attribute has following values Machine-Type = Application, Server, Network, Storage
then if the Group By selects Machine-Type as attribute, correlation engine will automatically group alerts which have
"Machine-Type == Application" into one group. "Machine-Type == Server" into one group, "Machine-Type == Storage" into another group, "Machine-Type == Network" into another group.
Group By can also use multiple attributes for advanced scenarios to yield more complex situations.
Continuing from same example above, let's add one more attribute and use Group By with two attributes
Machine-Type = Application, Server, Network, Storage
Environmnt = Prod, UAT
With two group by attribute selections indicated above, following alert group correlations will be
"Machine-Type == Application and Environment == Prod" into one group.
"Machine-Type == Application and Environment == UAT" into one group.
"Machine-Type == Server and Environment == Prod" into one group.
"Machine-Type == Server and Environment == UAT" into one group.
"Machine-Type == Storage and Environment == Prod" into one group.
"Machine-Type == Storage and Environment == UAT" into one group.
"Machine-Type == Network and Environment == Prod" into one group.
"Machine-Type == Network and Environment == UAT" into one group.