How it Works
Basic Concepts, Technical Overview and How it Works
cfxOIA works by ingesting IT operational data, like alerts, events, and traces from multiple performance monitoring tools, logs, and log-based alerts from log monitoring tools and observational data from data-lakes for performing algorithmic correlation of alerts to reduce noise. OIA normalizes every alert with enrichment data established by stitching CMDB data, service mappings, and asset management data together to derive context-rich data for every alert that is ingested into the platform.
cfxOIA then correlates alerts, based on enriched data. Identifying correlation patterns is done on OIA's machine learning engine to identify symptomatic patterns in alert data. These patterns are then provided as recommendations to AIOps administrators to consider grouping or deduplication of future alerts that match those symptoms. Admins can create additional correlation policies to tune algorithmic correlation behavior to group alerts across on the entire application stack, within a time window, or in an infrastructure layer.
Alert Correlation process flow
cfxOIA has an out-of-box implementation to correlate well-known operational issues related to alert burst scenarios, alert flapping situations, and transient alerts. This robust correlation engine allows the admin to implement event correlation for any type of situation, where the majority of patterns are detected with an unsupervised machine learning combined with additional flexibility for admin configurable policies to tune correlation behavior. Alerts that are correlated are called Alert Groups and the policies are called Correlation Policies. Learn More about Alert Correlation here.
Deduplicated and correlated alerts are grouped in an Alert Group that indicates an active operational issue or an OIA Incident. Every Alert Group has one OIA Incident, which is sent to the ITSM systems (like ServiceNow, PagerDuty, etc,.) and to OIA Incident Room for further Incident processing.
Accelerates Incident Resolution with all context, triage data and tools at one place
Incident Room is a dynamic and incident-centric workbench that provides all the triage data, Operational metrics, KPIs, Logs, Impacted assets context, Collaboration, and Diagnostic tools all at one place, so that operators can swiftly perform incident root cause analysis and service restoration. This helps in reducing Incident MTTR.
For on-premise deployments, OIA is offered as a packaged application that is available for deployment on VMware vCenter 6.x or above. (OVF Image). The packaged application comprises of CentOS as the base operating system along with the required OS packages and 3rd party software modules. In large enterprise environments (restricted environments), where the customer prefers to install OIA on a custom Linux version (for example - RHEL 7.x), it is possible to bring up OIA on existing Virtual Machines.
OIA operates on IT operational data like alerts, events, traces, metrics, most of which are generated by monitoring tools and in some cases replicated in an aggregate data-lake. OIA supports integrations with many featured vendors using Webhooks, APIs, Kafka messages, etc. Custom integrations can be developed and supported by CloudFabrix professional services, Partners, using CloudFabrix Provided Developer SDKs.
Large enterprise environments have a mix of structured and unstructured IT data sources and many custom IT data parameters defined and implemented across various data sources. For example, IT environments can implement custom attributes like machine type, environment, site code, department name, support group, application ID, etc. Not every tool implements these attributes, making it difficult to understand which operational data sources are relevant for AIOps implementation and which attributes can be gleaned from which sources to enrich raw alert data. This is where the OIA Data Analysis and Stitching module comes into the picture to help establish
- Asset Identities
- Enrichment Attributes
- Enrichment Flows
- Baseline Analysis
This module works off of historical alert/event data, Ticket data, CMDB data, Service mappings, Asset management and establishes a data chain that will help in appropriate data source selection and enrichment attributes selection for AIOps implementation.
Raw alert data contains extremely limited information, often consisting of id, severity, message/description, rule name, and asset IP/hostname, etc. This information doesn't provide enough service context (Application or Service name, Environment, machine-type, etc.) or supportability context (NOC id, Site-id, Department, Support-group, etc.) which are essential data for efficient correlation of alerts. OIA performs automated alert data enrichment using a combination of following approaches
- Enrichment with stacks and asset context established through Data Analysis & Stitching module
- Enrichment with stacks and asset context that is dynamically discovered/resolved for elastic environments
Alerts and events, in general, have a varying schema, but in OIA they are all normalized and standardized to an extensive model (with more than 30+ attributes). Related alerts are deduplicated and correlated to form Alert Groups. OIA's correlation engine provides recommendations for detecting and grouping new alert patterns. Admins can grasp, analyze the recommendations, and convert into Correlation Policies or define new policies altogether. Admins can also implement alert Suppression Policies to suppress alerts that escape during maintenance windows. OIA provides out of box policies to treat well-known operational issues like alert burst scenarios, flapping scenarios, etc.
OIA creates Incident for every Alert Group and sends them to ITSM tools (ServiceNow, PagerDuty, etc.) for further processing by IT Analysts, NOC Engineers, or Tier-1/Tier-2 Engineers. OIA provides a module called Incident Room that AIOps operators and ITSM operators can use to accelerate incident analysis, resolution. The Incident room provides all the relevant context, Data, Insights, and Tools at one place for incident resolution.
OIA uses machine learning (ML) at its core to intelligently learn patterns from huge volumes of historical as well as streaming data and automate key IT operational activities and decisions at large scale.
Key ML driven Operations include
- Alert Correlation (uses unsupervised ML)
- Log Clustering and Heatmap
- Alert volume Seasonality (can be run per app, source system, severity etc.)
- Alert volume Anomaly Detection (can be run per app, source system, severity etc.)
- Alert volume Prediction (can be run per app, source system, severity etc.)
- Incident triage data Anomaly Detection and Noticeable Changes
- Similar incidents
Prediction insight consists of forecasting alert volume or ingestion rate, providing a perspective into how many alerts Ops team can expect in future. OIA can perform this prediction analysis on multiple dimensions, including alerts coming a certain source, or alerts of certain application, severity, site or even alerts of certain symptom. In addition to prediction insights, OIA also provides seasonality and anomaly detection when ML jobs are run, which can be executed on-demand or scheduled to be run periodically, which helps in continous learning, training and testing of models.
OIA currently supports 3 ML pipelines out of the box, Clustering, Classification and Regression. ML jobs allow hyper parameter tuning by making selections from UI itself. Advanced customization scenarios allow uploading of new ML pipelines.
OIA provides key analytics to track AIOps related KPIs like noise reduction efficiency, Alert ingestion trends, Most chatty alert types, etc.
OIA has a unique data exploration feature called Quick Insights that provides an at-a-glance visual clue of distribution and other characteristics of data. Quick Insights on Incidents provide visual cues about the distribution of Incidents based on priority, Support-group, Incident-age, Environment, Application, Department, etc. Similarly, Quick Insights provide visual cues about the distribution of alerts across severity, Source, Application, Machine type, Environment, etc.
OIA provides a web-based portal that is accessible via a standard browser and uses HTML 5 to render User Interface (UI). There is no need to install any thick client to access the OIA web-portal. OIA portal provides certain advanced UI features for efficient data handling and customization.
Filters: Allow efficient filtering, Saving, and Reusing filters. Table View Customization
Customize Columns: Displayed in the table, Change the order of columns.
Exporting Data: Exporting data like Incidents, Alerts, etc.