ML Driven Operations
Using ML to perform Correlation Prediction, Seasonality and Anomaly Detection
OIA uses machine learning (ML) at its core to intelligently learn patterns from huge volumes of historical as well as streaming data and automate key IT operational activities and decisions at large scale. AIOps admins can execute ML jobs on-demand using any ML pipeline available in the system. OIA's has three out of the box (OOTB) ML pipelines, and all hyper parameter tuning can be performed from UI itself. Advanced customization scenarios allow uploading of new ML pipelines.
OIA leverage ML to drive majority of alert correlations. OIA detects various alert problems or symptoms in alert or event data and automatically starts correlating future incoming alerts based on these patterns. OIA uses unsupervised machine learning algorithms that remove stop words, identities etc. (de-variablization) and then tries to form clusters based on alert message. OIA also allows AIOps admin to further narrow down scope of alert symptom correlations to a particular application, stack or an infrastructure layer.
AI driven Correlation Engine performing majority of alert correlations
Sampling from OIA found alert clusters in training phase
Efficiently relating and analyzing large volumes of logs is a common challenge for IT operations teams. When an Incident occurs, logs become one of the key sources of triage data. To make it easy to analyze log data, OIA uses clustering algorithms to automatically detect and group all related logs into clusters, and then plot them on a heatmap to indicate the density and occurence windows of log clusters. This will help incident operators to conduct root cause analysis faster.
OIA performing log clustering on Incident triage data and plotting on heat map
Prediction insight consists of forecasting alert volume or ingestion rate, providing a perspective into how many alerts Ops team can expect in future. OIA can perform this prediction analysis on multiple dimensions, including alerts coming a certain source, or alerts of certain application, severity, site or even alerts of certain symptom. In addition to prediction insights, OIA also provides seasonality and anomaly detection when ML jobs are run, which can be executed on-demand or scheduled to be run periodically, which helps in continous learning, training and testing of models.
Prediction chart for All Critical Alerts
Prediction chart for IP Address Conflict symptoms
OIA dynamically retreives incident-specific performance metrics from multiple monitoring tools as part triage data gathering. This data is typically retrived 20 hours prior to incident occurrence and it keeps retreiving 4 hours after (these values are configurable). In addition to this, OIA also detects local anomalies and noticeable changes by bucketizing data into smaller intervals, like 15 mins, and then performs regression analysis and statistical analysis to figure out anomalies and noticeable changes.
Incident Triage data showing Noticeable Changes
Incident Triage data showing Anomalies
OIA currently supports 3 ML pipelines out of the box, Clustering, Classification and Regression. ML jobs allow hyper parameter tuning by making selections from UI itself. Advanced customization scenarios allow uploading of new ML pipelines.