ODH Logo

Potential Diagnosis Discovery

The CCX team at Red Hat defines a symptom as a piece of health data reported by an OpenShift deployment, that could indicate there is something wrong with that deployment. Examples of symptoms include alerts, failing operators, pre-defined rules being triggered, etc. A particular combination of symptoms which is found to occur in several deployments, and which could be indicative of a specific issue, is called a potential diagnosis.

In this notebook, we will investigate the use of machine learning for finding such combinations of symptoms that could be used as potential diagnoses. Broadly speaking, we will first try using clustering to find groups of deployments that show similar symptoms. Then, we will explore frequent pattern mining to find the combinations seen most often in each cluster. Lastly, we will try to determine the "defining" symptoms and symptom combinations for deployments in each cluster. These are the results we want to present to subject matter experts as candidates for diagnosis definitions for the problems seen in deployments.

Internally, the data we used for this experiment was that provided by customers' production OpenShift deployments. Since we cannot open source that data (for obvious reasons), in this notebook we will restrict ourselves to use the data collected from CI/CD deployments (which is already publicly accessible). Therefore, the results in this notebook might not accurately portray the effectiveness of this ML approach.

[28]
# imports
import copy
import datetime as dt

import numpy as np
import pandas as pd

from umap import UMAP
from sklearn.cluster import DBSCAN

import mlxtend.frequent_patterns

import plotly
import plotly.express as px
from plotly import graph_objects as go

import ipywidgets as widgets
from ipywidgets import interactive
from IPython.core.display import display, HTML
[2]
# set this to False to just view images inside notebook only
# set this to True to also save the images them locally
SAVE_PLOTS = True
[3]
# "tag" to identify images produced during current run of the notbeook
IMAGES_TAG = str(int(dt.datetime.now().timestamp()))
[4]
# visualization utils
def create_hoverinfo(row):
    """
    Helper function for labelling data on plotly scatter3d
    For input one-hot encoded row of symptoms, returns a string that is the
    concatenation of the symptoms shown by this deployment, joined by "<br>"
    The "<br>" ensures that only one symptom is displayed per line in the hover info box in plotly
    """
    symptomlist = row[row != 0].index.tolist()
    return "<br>".join(sorted(symptomlist))


# color scheme
custom_colors = copy.deepcopy(px.colors.qualitative.Dark24)
primary = custom_colors[0]
to_remove = [primary] + [custom_colors[i] for i in [5, 7]]
for t in to_remove:
    custom_colors.remove(t)

Get Symptoms Data from CI/CD Deployments

In this section, we will fetch the symptoms dataset. This is just a table describing which deployments are showing which symptoms.

Internally this dataset is extracted from Kraken Reports, which provide a summary of health and usage status of customer deployments. Since the tools developed by CCX have not been open sourced yet, we won't be using them in this notebook. Instead, we will simply read the sample data that has already been extracted and provided in this repo.

[5]
# YYYYMMDD string that specifies the date from which we want data
DATE_PREFIX = "20201203"
[6]
# combine all reports to create clusters df
clusters_df = pd.read_parquet(
    f"../../data/processed/clusters_df_{DATE_PREFIX}.parquet"
)

print(clusters_df.shape)
clusters_df.head()
(565, 13)
cluster_id email_domain support managed initial_version current_version desired_version platform network_type install_type etc_objects_count anomaly_score current_version_maj_min
2 0020a49e-d3c1-4d30-890e-3c988f03d3cd redhat.com Eval False 4.7.0-0.ci.test-2020-12-02-234216-ci-op-m3tr842i 4.7.0-0.ci.test-2020-12-02-234216-ci-op-m3tr842i 4.7.0-0.ci.test-2020-12-02-234216-ci-op-m3tr842i AWS OpenShiftSDN IPI 9606.0 NaN 4.7
25 00e8050c-dc82-4a72-80c3-4b535243230f redhat.com Eval False 4.7.0-0.ci.test-2020-12-02-204557-ci-op-mxf8h1wz 4.7.0-0.ci.test-2020-12-02-204557-ci-op-mxf8h1wz 4.7.0-0.ci.test-2020-12-02-204557-ci-op-mxf8h1wz GCP OpenShiftSDN IPI 569.0 NaN 4.7
40 013895d7-4d77-4522-9d87-fe26a324fdfe redhat.com Eval False 4.7.0-0.ci.test-2020-12-02-091239-ci-op-k21s380c 4.7.0-0.ci.test-2020-12-02-091239-ci-op-k21s380c 4.7.0-0.ci.test-2020-12-02-091239-ci-op-k21s380c AWS OpenShiftSDN IPI 569.0 NaN 4.7
63 01dcac72-145a-4897-928c-dc694c63dbc6 redhat.com Eval False 4.7.0-0.nightly-2020-12-03-012053 4.7.0-0.nightly-2020-12-03-012053 4.7.0-0.nightly-2020-12-03-012053 GCP OpenShiftSDN IPI 8329.0 NaN 4.7
99 02e77350-8c29-474d-8515-feb81e6a6877 redhat.com Eval False 4.5.0-0.ci.test-2020-12-03-004835-ci-op-n1p0miw9 4.5.0-0.ci.test-2020-12-03-004835-ci-op-n1p0miw9 4.5.0-0.ci.test-2020-12-03-004835-ci-op-n1p0miw9 AWS OpenShiftSDN IPI 8968.0 NaN 4.5

NOTE When this notebook was run with production deployments data, we had several thousands of clusters per day. However, we have only ~550 ci clusters. So there is a lot less training data.

[7]
# combine symptoms data collected at all time intervals for the current date
clusters_symptoms_df = pd.read_parquet(
    f"../../data/processed/clusters_symptoms_df_{DATE_PREFIX}.parquet"
)

print(clusters_symptoms_df.shape)
clusters_symptoms_df.head()
(2314, 2)
cluster_id symptom_id
21 0020a49e-d3c1-4d30-890e-3c988f03d3cd alert|AlertmanagerReceiversNotConfigured
169 00e8050c-dc82-4a72-80c3-4b535243230f rule|operators_check|OPERATOR_ISSUE
170 00e8050c-dc82-4a72-80c3-4b535243230f alert|AlertmanagerReceiversNotConfigured
171 00e8050c-dc82-4a72-80c3-4b535243230f foc|monitoring|Degraded|UpdatingprometheusAdap...
172 00e8050c-dc82-4a72-80c3-4b535243230f rule|pods_check_containers|POD_CONTAINER_ISSUE

NOTE You may notice that some clusters are not included in this df. That is because there are no symptoms recorded for them

Apply Known Diagnoses

In this section, we will apply the known diagnoses to the symptoms data. That is, for each deployment, check whether the symptoms correspond to the definition of a known diagnosis. If they do, then label the deployment with that particular diagnosis.

By default, the diagnosis definitions used are the ones defined in the "Symptoms and Diagnoses Walkthrough" notebook on the kraken repo (see screenshots below). Again, since CCX tools have not been open sourced yet, we will use the labels for the sample data already extracted and provided in this repo.

NOTE The diagnosis labels will NOT be used to train the ML models. Rather, they are here for evaluation and validation purpose, i.e. to compare the output of ML models against human definted diagnoses,.

Screenshot%20from%202021-01-26%2022-04-35.png Figure: Definition of the "sdn-issue" diagnosis

Screenshot%20from%202021-01-26%2022-03-41.png Figure: Definition of the "kubelet-down" diagnosis

[8]
# get known diagnosis labels
diag_names = [
    "ignored-symptoms",
    "sdn-issue",
    "kubelet-down",
    "BZ-1821905-DefaultSecurityContextConstraints_Mutated",
    "4.3-major-upgrade-autoscaler",
]
diagnoses_df = pd.read_parquet(
    f"../../data/processed/diagnoses_df_{DATE_PREFIX}.parquet"
)

print(diagnoses_df.shape)
diagnoses_df.head()
(1527, 2)
cluster_id symptom_id
18 0020a49e-d3c1-4d30-890e-3c988f03d3cd alert|AlertmanagerReceiversNotConfigured
138 00e8050c-dc82-4a72-80c3-4b535243230f rule|nodes_requirements_check|NODES_MINIMUM_RE...
139 00e8050c-dc82-4a72-80c3-4b535243230f foc|monitoring|Degraded|UpdatingprometheusAdap...
140 00e8050c-dc82-4a72-80c3-4b535243230f foc|version|Failing|ClusterOperatorDegraded|8c59
141 00e8050c-dc82-4a72-80c3-4b535243230f foc|monitoring|Progressing|RollOutInProgress|f311
[9]
# how many ci/cd clusters matched the signature/definition of known diagnoses
diagnoses_df[diagnoses_df["symptom_id"].str.startswith("diagnosis")]
cluster_id symptom_id
619 034265cb-0256-4a34-8975-fad19d73cf66 diagnosis|sdn-issue
34482 aacf11ec-249d-4837-a9ac-749913fc4943 diagnosis|sdn-issue
34483 aacf11ec-249d-4837-a9ac-749913fc4943 diagnosis|kubelet-down
362 01dcac72-145a-4897-928c-dc694c63dbc6 diagnosis|sdn-issue
363 01dcac72-145a-4897-928c-dc694c63dbc6 diagnosis|kubelet-down
4120 15d71e6b-ccd9-4b3c-86c9-3f4e36d14499 diagnosis|sdn-issue

NOTE There are only 4 deployments with known diagnoses. So in addition to not having a ton of training data, we also don't have many labels.

Create Training Dataset

In this section, we slice-and-dice the dataset. In the widgets output by the cell below,

1. Use the first selection box to select the "major.minor" OpenShift version (e.g. 4.3). This will keep only the deployments from that major.minor version in the dataset.

2. Within a major.minor version, you can decide to keep only the deployments from specific patch versions (e.g. 4.3.2 and 4.3.3). Use the second selection box to select the patch versions (by default, all will be included). Use Ctrl + click or Shift + click to select multiple.

3. Optionally, if you wish to confine the dataset to deployments of a specific customer, use the drop down to select that customer's email domain.

After slicing-dicing, we will change the format in which data is represented from a "string" representation to a "one-hot encoded" numeric representation. The result will be the final dataset that will be preprocessed and fed to ML models.

[10]
# mapping from maj.min to all maj.min.patch
patches_per_majmin = clusters_df.groupby("current_version_maj_min").apply(
    lambda g: sorted(g["current_version"].unique())
)
patches_per_majmin["ALL"] = sorted(clusters_df["current_version"].unique())

# how many depls of each maj.min
depls_per_majmin = clusters_df[
    clusters_df.cluster_id.isin(clusters_symptoms_df.cluster_id)
]["current_version_maj_min"].value_counts()
depls_per_majmin["ALL"] = depls_per_majmin.sum()

# add number of dpels info to maj.min names, sort by num depls
patches_per_majmin = patches_per_majmin.sort_index()
new_index = [
    "{0} ({1:4d} depls)".format(val, ct)
    for val, ct in depls_per_majmin.iteritems()
]
patches_per_majmin.index = sorted(new_index)


# func needed to ensure both select boxes are displayed
def do_nothing(dummy_input):
    return


# func to update pathces shown based on maj.min selected
def select_patch_version(maj_min_version):
    maj_min_patch_widget.options = patches_per_majmin[maj_min_version]
    maj_min_patch_widget.value = patches_per_majmin[maj_min_version]


# widget to select maj.min, showing in descending order of # depls
maj_min_widget_options = sorted(
    patches_per_majmin.index,
    key=lambda x: int(x.rsplit(" ", 1)[0].rsplit("(")[-1]),
    reverse=True,
)
maj_min_widget = widgets.Select(
    options=maj_min_widget_options,
    value=maj_min_widget_options[0],
    description="Major.Minor",
)

# widget to select maj.min.patch, all selected by default
maj_min_patch_widget = widgets.SelectMultiple(
    options=patches_per_majmin[maj_min_widget.value],
    value=patches_per_majmin[maj_min_widget.value],
    description="Patches",
)

# emails + count (how many depls of that email domain)
email_vc = clusters_df[
    clusters_df.cluster_id.isin(clusters_symptoms_df.cluster_id)
]["email_domain"].value_counts()
emails_with_ct = []
for i, v in email_vc.items():
    emails_with_ct.append(f"{i}  ({v} depls)")

# widget to select email domain, all selected by default
email_widget = widgets.Dropdown(
    options=[f"ALL ({(email_vc.sum())})"] + emails_with_ct,
    description="Email Domain",
)

# display all widgets
display(interactive(select_patch_version, maj_min_version=maj_min_widget))
display(interactive(do_nothing, dummy_input=maj_min_patch_widget))
display(interactive(do_nothing, dummy_input=email_widget))
interactive(children=(Select(description='Major.Minor', options=('ALL ( 565 depls)', '4.7 ( 385 depls)', '4.6 …
interactive(children=(SelectMultiple(description='Patches', index=(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1…
interactive(children=(Dropdown(description='Email Domain', options=('ALL (565)', 'redhat.com  (565 depls)'), v…

In the next cell, we reformat the data as follows: instead of having one row per (deployment, symptom) pair, we'll have one row per deployment and one column per possible symptom. A value of "0" in a column indicates that that deployment (row) does not show that symptom (column) and a value of "1" indicates that it does.

[11]
# filter data and pivot so that each row is a deployment

# keep only deployments of selected version
X_df = clusters_symptoms_df[
    clusters_symptoms_df.cluster_id.isin(
        clusters_df["cluster_id"][
            clusters_df["current_version"].isin(maj_min_patch_widget.value)
        ]
    )
]

# keep only deployments of selected email
EMAIL = email_widget.value.split()[0]
if EMAIL.lower() != "all":
    X_df = X_df[
        X_df.cluster_id.isin(
            clusters_df[clusters_df["email_domain"] == EMAIL].cluster_id
        )
    ]

# drop nan rows, if any (ideally shouldnt exist)
nan_count = X_df.isna().sum().sum()
if nan_count != 0:
    print(f"Found {nan_count} nans in filtered X_df")
    X_df = X_df.dropna()

# ================================ checks for pivot ================================ #
# def check_pivot(X_df, X_df_new):
#     for n,g in X_df.groupby('cluster_id'):
#         res = X_df_new.loc[n]
#         if not res[res==1].index.difference(g['symptom_id'].values).empty:
#             pdb.set_trace()
#             return False
#     return True
# pre = X_df['cluster_id'].nunique()
# ================================================================================== #

# pivot so that each row represents a cluster id, and the columns are the symptoms
# NOTE: we need a numerical value column for pivot to work. assign value=1 dummy col
X_df = X_df.assign(value=1).pivot_table(
    index="cluster_id",
    columns=X_df.columns.drop("cluster_id").tolist(),
    values="value",
)

# ================================ checks for pivot ================================ #
# post = len(X_df_new)
# assert pre==post
# assert check_pivot(X_df, X_df_new)
# ================================================================================== #

X_df = X_df.fillna(value=0)

# # add UPI/IPI info
# # NOTE: if there are nans for install_type for some depl ids in clusters_df,
# # those depl ids would still show up here, but with 0's in is_UPI and is_IPI
# X_df = X_df.assign(
#     is_UPI=X_df.index.isin(
#         clusters_df["cluster_id"][
#             clusters_df["install_type"] == "UPI"
#         ].unique()
#     ).astype(np.float64)
# )
# X_df = X_df.assign(
#     is_IPI=X_df.index.isin(
#         clusters_df["cluster_id"][
#             clusters_df["install_type"] == "IPI"
#         ].unique()
#     ).astype(np.float64)
# )

print(
    "The following is a glimpse of the sliced/diced and formatted dataset that will be fed to ML models:"
)
print(f"Shape = {X_df.shape}")
X_df.head()
The following is a glimpse of the sliced/diced and formatted dataset that will be fed to ML models: Shape = (493, 162)
symptom_id alert|AlertmanagerReceiversNotConfigured alert|CloudCredentialOperatorDown alert|ClusterAutoscalerOperatorDown alert|ClusterNotUpgradeable alert|FluentdNodeDown alert|KubeAPIDown alert|KubeAPIErrorBudgetBurn alert|KubeAPIErrorsHigh alert|KubeClientErrors alert|KubeControllerManagerDown ... rule|nodes_pressure_check|NODE_PRESSURE rule|nodes_requirements_check|NODES_MINIMUM_REQUIREMENTS_NOT_MET rule|ocp_version_end_of_life|OCP4X_BEYOND_EOL rule|operators_check|OPERATOR_ISSUE rule|pods_check_containers|POD_CONTAINER_ISSUE rule|pods_check|POD_ISSUE rule|pods_crash_loop_check|POD_CRASHLOOP_ISSUE rule|version_check|CLUSTER_VERSION_MISMATCH rule|version_forced|FORCED_VERSION_UPDATES rule|version_retarget|ABORTED_UPDATES_IN_RECENT_HISTORY
cluster_id
000167dc-b92b-4677-bcb9-28cf8b2eded3 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0020a49e-d3c1-4d30-890e-3c988f03d3cd 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
00e8050c-dc82-4a72-80c3-4b535243230f 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 ... 0.0 1.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0
00ecb624-57ad-4f9d-8580-b6e1373c5a45 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0
013895d7-4d77-4522-9d87-fe26a324fdfe 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 162 columns

Preprocess Data

The raw one hot encoded dataframe of symptoms has a lot of columns, i.e. it's very high dimensional. Many clustering algorithms perform suboptimally with high dimensional data (curse of dimensionality). Therefore, in this section we will perform dimensionality reduction.

First, we will manually remove some features that are not as relevant for diagnosis discovery as others. Various approaches for feature selection were explored in this notebook. The best results (with customer data, not CI/CD data) were seen when we dropped the features that subject matter experts did not consider very informative.

Next, we'll use the UMAP algorithm to create a 3-d representation of our data. Essentially, UMAP tries to create a low dimension representation in such a way that the "patterns" and "relationships" that exist in the data are preserved as much as possible. In other words, it should place deployments with similar symptoms close to one another, and those with different symptoms away from one another.

NOTE: It's possible that there are better techniques for dimension reduction; UMAP here just serves as an example or a baseline. Exploring other techniques is out of the scope of this notebook (check out this notebook for that).

[12]
# drop columns ccx believes are not very informative
ignored_symptoms = [
    c
    for c in X_df.columns
    if any(i in c for i in ["ClusterOperatorDown", "ClusterOperatorDegraded"])
]
ignored_symptoms += [
    "rule|operators_check|OPERATOR_ISSUE",
    "rule|pods_check|POD_ISSUE",
    "rule|pods_check_containers|POD_CONTAINER_ISSUE",
]
ignored_symptoms += ["alert|AlertmanagerReceiversNotConfigured"]

print("Dropping symptoms:\n")
for i in ignored_symptoms:
    print(i)
X_df = X_df.drop(ignored_symptoms, axis=1)

# drop rows that are all 0s coz no useful data
X_df = X_df[X_df.sum(axis=1) != 0]
X_df.shape
Dropping symptoms: alert|openshift-cluster-version|ClusterOperatorDegraded alert|openshift-cluster-version|ClusterOperatorDown foc|version|Failing|ClusterOperatorDegraded|007c foc|version|Failing|ClusterOperatorDegraded|8c59 foc|version|Progressing|ClusterOperatorDegraded|4696 rule|operators_check|OPERATOR_ISSUE rule|pods_check|POD_ISSUE rule|pods_check_containers|POD_CONTAINER_ISSUE alert|AlertmanagerReceiversNotConfigured
(382, 153)
[13]
# define umap instance
umap_instance = UMAP(
    n_components=3,
    metric="hamming",
    n_neighbors=64,
    min_dist=0.005,
    random_state=42,
)

# fit umap to our data and transform (reduce dimensions) the data using it
X_df_trans = umap_instance.fit_transform(X_df)
/opt/app-root/lib/python3.6/site-packages/umap/umap_.py:1530: UserWarning: gradient function is not yet implemented for hamming distance metric; inverse_transform will be unavailable

Visualize

Visualizing the data will help us see any patterns or properties in the data. It could also give us a rough idea (based on the structure/layout of our dimension reduced data) on how well could any clustering algorithm possibly perform.

[14]
# visualize deployments and known diagnoses
fig = go.Figure()

# data to display when hover over - list of all symptoms
hoverdata = X_df.apply(create_hoverinfo, axis=1)

# add all deployments
fig.add_trace(
    go.Scatter3d(
        name="all_symptoms",
        mode="markers",
        x=X_df_trans[:, 0],
        y=X_df_trans[:, 1],
        z=X_df_trans[:, 2],
        hovertemplate="<b>_id</b>: %{customdata}<br>" + "<br>%{text}",
        customdata=hoverdata.index.tolist(),
        text=hoverdata.values,
        marker=dict(size=2, color=primary),
    )
)


for di, diag_name in enumerate(diag_names):
    # get deployments hit with current diagnosis
    labels = X_df.index.isin(
        diagnoses_df[diagnoses_df["symptom_id"] == f"diagnosis|{diag_name}"][
            "cluster_id"
        ].unique()
    )

    # add data points that have this diagnosis
    fig.add_trace(
        go.Scatter3d(
            name=diag_name,
            mode="markers",
            x=X_df_trans[labels, 0],
            y=X_df_trans[labels, 1],
            z=X_df_trans[labels, 2],
            hovertemplate="<b>_id</b>: %{customdata}<br>" + "<br>%{text}",
            customdata=hoverdata[labels].index.tolist(),
            text=hoverdata[labels].values,
            marker=dict(size=3, color=custom_colors[di]),
        )
    )

if SAVE_PLOTS:
    savedir = "../../reports/figures"
    fname = f"symptoms_{umap_instance.metric}_all_diagnoses"
    suffix = f"{DATE_PREFIX.replace('/', '_')}_{IMAGES_TAG}"

    print(f"Saving plot as '{savedir}_{fname}_{suffix}.html'")
    plotly.offline.plot(
        fig,
        filename=f"{savedir}/{fname}_{suffix}.html",
        auto_open=True,
    )
fig.show()
Saving plot as '../../reports/figures_symptoms_hamming_all_diagnoses_20201203_1611765763.html'

Interpretation

  • This graph shows the 3d representation of the dataset that UMAP created. Each point on the graph represents a deployment.
  • Hovering over a point (deployment) will show all the symptoms shown by that deployment.
  • The points on the graph are colored by known diagnoses. Here, for example, all the red points represent deployments that kraken diagnosed with "kubelet-down". All the green points represent deployments that kraken diagnosed with "kubelet-down"

Apply Clustering

Now, we will use DBSCAN for clustering the dimension reduced data. NOTE: DBSCAN may not be the final algorithm that we use for clustering, but it serves as a good starting point.

[15]
# fit a vanilla dbscan model
dbscan = DBSCAN(
    eps=0.325,
    n_jobs=-1,
)
dbscan.fit(X_df_trans)
DBSCAN(eps=0.325, n_jobs=-1)
[16]
# unique labels (clusters of deployments) found by dbscan
unique_dbscan_labels = np.unique(dbscan.labels_)
[17]
# visualize deployments, colored by dbscan cluster label
fig = go.Figure()

# data to display when hover over - list of all symptoms
hoverdata = X_df.apply(create_hoverinfo, axis=1)

for c in unique_dbscan_labels:
    # index where points are of this class
    c_pts = dbscan.labels_ == c

    # add those points
    fig.add_trace(
        go.Scatter3d(
            x=X_df_trans[c_pts, 0],
            y=X_df_trans[c_pts, 1],
            z=X_df_trans[c_pts, 2],
            name=f"Cluster ID {c}",
            mode="markers",
            hovertemplate="<b>_id</b>: %{customdata}<br>" + "<br>%{text}",
            customdata=hoverdata[c_pts].index.tolist(),
            text=hoverdata[c_pts].values,
            marker=dict(size=2),
        )
    )

if SAVE_PLOTS:
    savedir = "../../reports/figures"
    fname = f"dbscan_{dbscan.eps}_{dbscan.min_samples}"
    suffix = f"{DATE_PREFIX.replace('/', '_')}_{IMAGES_TAG}"

    print(f"Saving plot as '{savedir}/{fname}_{suffix}.html'")
    plotly.offline.plot(
        fig,
        filename=f"{savedir}/{fname}_{suffix}.html",
        auto_open=True,
    )
fig.show()
Saving plot as '../../reports/figures/dbscan_0.325_5_20201203_1611765763.html'

Interpretation

  • This graph is to be interpreted in a similar way as the one above. The only difference is in the way the points are colored.
  • The color of a point represent the cluster (group) that the clustering algorithm assigned it to. For example, all the yellow points represent deployments that were labelled as belonging to cluster id 8.

INTERESTING NOTE 1 In the above graph, it can be seen the most popular symptom in deployments in cluster id 2 is POD_CRASHLOOP_ISSUE, and that for deployments in its neighboring cluster id 8 is NODES_MINIMUM_REQUIREMENTS_NOT_MET. The deployments that lie at the boundary of these two clusters tend to have both of the above symptoms. This suggests that the latent representation learned and the clusters formed are somewhat meaningful.

INTERESTING NOTE 2 In the above graph, it can be seen the most popular symptom in deployments in cluster id 4 is FORCED_VERSION_UPDATES, and that for deployments in its nearby cluster id 8 is NODES_MINIMUM_REQUIREMENTS_NOT_MET. For the deployments in cluster id 11, which lies between the above two clusters, the most popular symptoms are both FORCED_VERSION_UPDATES and NODES_MINIMUM_REQUIREMENTS_NOT_MET. This again suggests that the latent representation learned and the clusters formed are somewhat meaningful.

[18]
# for each cluster, what fraction of deployments have a particular diagnosis?
dbclust_diagnosis_df = pd.DataFrame(
    index=unique_dbscan_labels,
    columns=["cluster_size"] + [f"percent_{i}" for i in diag_names],
)
dbclust_diagnosis_df.index.rename("cluster_id", inplace=True)
for c in unique_dbscan_labels:
    clust_depls = X_df.index[dbscan.labels_ == c]
    dbclust_diagnosis_df.loc[c, "cluster_size"] = len(clust_depls)
    for diag in diag_names:
        does_clust_have_diag = clust_depls.isin(
            diagnoses_df[diagnoses_df["symptom_id"] == f"diagnosis|{diag}"][
                "cluster_id"
            ].unique()
        )
        try:
            dbclust_diagnosis_df.loc[
                c, f"percent_{diag}"
            ] = does_clust_have_diag.mean()
        except ZeroDivisionError:
            dbclust_diagnosis_df.loc[c, f"percent_{diag}"] = 0

pct_sorted_idx = (
    dbclust_diagnosis_df.drop("cluster_size", axis=1)
    .max(axis=1)
    .sort_values(ascending=False)
    .index
)

print(
    "The following table shows, what percent of deployments in a given cluster (group) had a particular diagnosis"
)
dbclust_diagnosis_df.reindex(pct_sorted_idx).head(10)
The following table shows, what percent of deployments in a given cluster (group) had a particular diagnosis
cluster_size percent_ignored-symptoms percent_sdn-issue percent_kubelet-down percent_BZ-1821905-DefaultSecurityContextConstraints_Mutated percent_4.3-major-upgrade-autoscaler
cluster_id
10 6 0 0.166667 0.166667 0 0
-1 23 0 0.0869565 0.0434783 0 0
9 12 0 0.0833333 0 0 0
12 6 0 0 0 0 0
11 19 0 0 0 0 0
8 72 0 0 0 0 0
7 23 0 0 0 0 0
6 41 0 0 0 0 0
5 11 0 0 0 0 0
4 20 0 0 0 0 0

Most Frequently Co-occuring Symptoms

Now that clustering is done, we have a rough idea of which deployments are similar to each other (and therefore belong in the same cluster). In order to get the potential diagnosis that the deployments in a cluster supposedly share, we want to determine what makes deployments in that cluster different than the rest. That is, which symptoms are the defining characteristics of the cluster.

To do this, we'll first use "frequent pattern mining" algorithms to find which symptom combinations are the most "dominant" in each cluster. Then, we'll compare the frequent symptom combinations in a cluster with those in others to determine which symptom combinations are unique to that cluster.

[19]
def get_frequent_symptom_combinations(
    symptom_onehot_df, algo="fpgrowth", min_support=0.9, drop_singles=False
):
    # calculate frequent patterns as per the algorithm name passed
    ret = getattr(mlxtend.frequent_patterns, algo)(
        df=symptom_onehot_df, min_support=min_support, use_colnames=True
    )

    # keep only those "combinations" whose length is >1
    if drop_singles:
        ret = ret[ret["itemsets"].apply(lambda x: len(x)) > 1]

    # friendlier index and column names
    ret = ret.rename(
        columns={
            "itemsets": "symptom_combination",
            "support": "percent_affected",
        }
    ).set_index("symptom_combination")

    # sort by percent_affected to bring attention to more frequent symptoms
    ret = ret.sort_values(by="percent_affected", ascending=False)

    return ret

Overall Data

Before diving into frequent symptom combination in specific clusters, let's have a look at symptom combinations in the overall data, so that we have a baseline to compare against.

NOTE: This might not be super informative because overall, various deployments are affected by various issues, so the values for "average" issues and patterns are likely very small. Nonetheless, this is meant to give only a rough idea, and to see if there are any symptoms that are common across a significant portion of the fleet.

[20]
# average value of each symptom in overall data
X_df.mean().sort_values(ascending=False).to_frame("percent_affected").head(10)
percent_affected
symptom_id
rule|nodes_requirements_check|NODES_MINIMUM_REQUIREMENTS_NOT_MET 0.379581
rule|pods_crash_loop_check|POD_CRASHLOOP_ISSUE 0.225131
rule|version_forced|FORCED_VERSION_UPDATES 0.225131
alert|openshift-monitoring|KubePodNotReady 0.157068
alert|openshift-monitoring|TargetDown|cluster-monitoring-operator 0.115183
rule|machineconfig_stuck_by_node_taints|NODE_HAS_TAINTS_APPLIED 0.115183
alert|openshift-monitoring|KubeDeploymentReplicasMismatch 0.107330
rule|version_check|CLUSTER_VERSION_MISMATCH 0.091623
alert|openshift-cloud-credential-operator|CloudCredentialOperatorProvisioningFailed 0.073298
alert|ThanosQueryHighDNSFailures 0.065445
[21]
# find frequent combinations in the entire dataset
# NOTE: min_support = at least this % of depls should have this combination of symptoms
get_frequent_symptom_combinations(
    X_df, min_support=0.1, drop_singles=True
).head(10)
percent_affected
symptom_combination
(alert|openshift-monitoring|TargetDown|cluster-monitoring-operator, alert|openshift-monitoring|KubePodNotReady) 0.10733
(alert|openshift-monitoring|KubeDeploymentReplicasMismatch, alert|openshift-monitoring|TargetDown|cluster-monitoring-operator) 0.10733
(alert|openshift-monitoring|KubeDeploymentReplicasMismatch, alert|openshift-monitoring|KubePodNotReady) 0.10733
(alert|openshift-monitoring|KubeDeploymentReplicasMismatch, alert|openshift-monitoring|TargetDown|cluster-monitoring-operator, alert|openshift-monitoring|KubePodNotReady) 0.10733

Specific Cluster (diagnosis = "sdn-issue")

Let's perform this analysis for a specific cluster, cluster id 9. From the table at the end of Apply Clustering section, we see that one of the deployments in cluster was diagnosed with "sdn-issue". If our analysis is able to hint at the symptoms that define "sdn-issue", then it means this line of research is worth exploring further.

As per kraken, the "sdn-issue" diagnosis is defined by 1. Root Cause - AlertSymptom, "namespace": "openshift-sdn", "name": "KubeDaemonSetRolloutStuck" 2. Consequence - OperatorConditionSymptom, "operator": "dns", - --- OR --- - AlertSymptom, "namespace": "openshift-dns"

[22]
# specific cluster for which to extract frequent patterns
clustid = 9

# indexer into X_df. True where the deployment belongs to the cluster of id ==clustid
is_depl_in_clust = dbscan.labels_ == clustid

# how many deployments in this particular cluster
display(
    HTML(
        f"<h3> Num Deployments in this Cluster = {is_depl_in_clust.sum()}</h3>"
    )
)
display(
    HTML(
        f"<h3> Percent Deployments Assigned to this Cluster = {is_depl_in_clust.mean()}</h3>"
    )
)

Num Deployments in this Cluster = 12

Percent Deployments Assigned to this Cluster = 0.031413612565445025

[23]
# "average" symptom vector for this cluster
curr_clust_mean_symptom = (
    X_df[is_depl_in_clust].mean().to_frame("percent_affected")
)
display(HTML("<h3>Most Affecting Symptoms</h3>"))
display(
    curr_clust_mean_symptom.sort_values(
        "percent_affected", ascending=False
    ).head(10)
)

# "average" symptom vector for depls not in this cluster
other_clusts_mean_symptom = (
    X_df[~is_depl_in_clust].mean().to_frame("percent_affected")
)

# what symptoms have different occurence frequencies as compared to other clusters
diff = pd.merge(
    curr_clust_mean_symptom,
    other_clusts_mean_symptom,
    how="left",
    left_index=True,
    right_index=True,
    suffixes=("_this", "_others"),
)

# display in descending order of difference
display(HTML("<h3>Most <i>UNIQUELY</i> Affecting Symptoms</h3>"))
diff.reindex(
    (diff["percent_affected_this"] - diff["percent_affected_others"])
    .sort_values(ascending=False)
    .index
).head(10)

Most Affecting Symptoms

percent_affected
symptom_id
rule|version_forced|FORCED_VERSION_UPDATES 1.000000
alert|openshift-cluster-version|CannotRetrieveUpdates 1.000000
alert|openshift-cloud-credential-operator|CloudCredentialOperatorProvisioningFailed 0.916667
rule|version_check|CLUSTER_VERSION_MISMATCH 0.333333
rule|pods_crash_loop_check|POD_CRASHLOOP_ISSUE 0.250000
alert|default|KubeClientCertificateExpiration 0.250000
alert|openshift-multus|KubeDaemonSetRolloutStuck 0.250000
alert|ThanosQueryGrpcClientErrorRate 0.250000
alert|ThanosQueryHighDNSFailures 0.250000
alert|openshift-image-registry|KubeJobCompletion 0.166667

Most UNIQUELY Affecting Symptoms

percent_affected_this percent_affected_others
symptom_id
alert|openshift-cluster-version|CannotRetrieveUpdates 1.000000 0.021622
alert|openshift-cloud-credential-operator|CloudCredentialOperatorProvisioningFailed 0.916667 0.045946
rule|version_forced|FORCED_VERSION_UPDATES 1.000000 0.200000
rule|version_check|CLUSTER_VERSION_MISMATCH 0.333333 0.083784
alert|default|KubeClientCertificateExpiration 0.250000 0.005405
alert|ThanosQueryGrpcClientErrorRate 0.250000 0.010811
alert|openshift-multus|KubeDaemonSetRolloutStuck 0.250000 0.010811
alert|ThanosQueryHighDNSFailures 0.250000 0.059459
alert|openshift-image-registry|KubeJobCompletion 0.166667 0.000000
alert|openshift-image-registry|KubeContainerWaiting 0.166667 0.002703
[24]
# frequent symptom cominations in depls of this cluster
curr_clust_combinations = get_frequent_symptom_combinations(
    X_df[is_depl_in_clust], min_support=0.25, drop_singles=False
)
display(HTML("<h3>Most Affecting Symptom Combinations</h3>"))
display(curr_clust_combinations.head(10))

# frequent symptom patterns in depls of all other clusters
# NOTE: since num total depls is high, and it can contain depls of different types, min_support is kept v low (~5%)
other_clusts_combinations = get_frequent_symptom_combinations(
    X_df[~is_depl_in_clust], min_support=0.01, drop_singles=False
)

# what symptom patterns have different frequencies in this cluster as compared to other clusters
diff = pd.merge(
    curr_clust_combinations,
    other_clusts_combinations,
    how="left",
    left_index=True,
    right_index=True,
    suffixes=("_this", "_others"),
).fillna(0.05)

# display in descending order of difference
display(HTML("<h3>Most <i>UNIQUELY</i> Affecting Symptom Combinations</h3>"))
diff.reindex(
    (diff["percent_affected_this"] - diff["percent_affected_others"])
    .sort_values(ascending=False)
    .index
).head(10)

Most Affecting Symptom Combinations

percent_affected
symptom_combination
(rule|version_forced|FORCED_VERSION_UPDATES) 1.000000
(alert|openshift-cluster-version|CannotRetrieveUpdates) 1.000000
(alert|openshift-cluster-version|CannotRetrieveUpdates, rule|version_forced|FORCED_VERSION_UPDATES) 1.000000
(alert|openshift-cluster-version|CannotRetrieveUpdates, rule|version_forced|FORCED_VERSION_UPDATES, alert|openshift-cloud-credential-operator|CloudCredentialOperatorProvisioningFailed) 0.916667
(rule|version_forced|FORCED_VERSION_UPDATES, alert|openshift-cloud-credential-operator|CloudCredentialOperatorProvisioningFailed) 0.916667
(alert|openshift-cluster-version|CannotRetrieveUpdates, alert|openshift-cloud-credential-operator|CloudCredentialOperatorProvisioningFailed) 0.916667
(alert|openshift-cloud-credential-operator|CloudCredentialOperatorProvisioningFailed) 0.916667
(rule|version_check|CLUSTER_VERSION_MISMATCH) 0.333333
(alert|openshift-cluster-version|CannotRetrieveUpdates, rule|version_check|CLUSTER_VERSION_MISMATCH) 0.333333
(rule|version_check|CLUSTER_VERSION_MISMATCH, rule|version_forced|FORCED_VERSION_UPDATES) 0.333333

Most UNIQUELY Affecting Symptom Combinations

percent_affected_this percent_affected_others
symptom_combination
(alert|openshift-cluster-version|CannotRetrieveUpdates) 1.000000 0.021622
(alert|openshift-cluster-version|CannotRetrieveUpdates, rule|version_forced|FORCED_VERSION_UPDATES) 1.000000 0.050000
(alert|openshift-cloud-credential-operator|CloudCredentialOperatorProvisioningFailed) 0.916667 0.045946
(alert|openshift-cluster-version|CannotRetrieveUpdates, rule|version_forced|FORCED_VERSION_UPDATES, alert|openshift-cloud-credential-operator|CloudCredentialOperatorProvisioningFailed) 0.916667 0.050000
(rule|version_forced|FORCED_VERSION_UPDATES, alert|openshift-cloud-credential-operator|CloudCredentialOperatorProvisioningFailed) 0.916667 0.050000
(alert|openshift-cluster-version|CannotRetrieveUpdates, alert|openshift-cloud-credential-operator|CloudCredentialOperatorProvisioningFailed) 0.916667 0.050000
(rule|version_forced|FORCED_VERSION_UPDATES) 1.000000 0.200000
(alert|openshift-cluster-version|CannotRetrieveUpdates, rule|version_check|CLUSTER_VERSION_MISMATCH) 0.333333 0.050000
(alert|openshift-cluster-version|CannotRetrieveUpdates, rule|version_check|CLUSTER_VERSION_MISMATCH, rule|version_forced|FORCED_VERSION_UPDATES) 0.333333 0.050000
(rule|version_check|CLUSTER_VERSION_MISMATCH, rule|version_forced|FORCED_VERSION_UPDATES) 0.333333 0.083784

Interpretation

The symptoms such as alert|openshift-multus|KubeDaemonSetRolloutStuck, alert|ThanosQueryHighDNSFailures, and alert|ThanosQueryGrpcClientErrorRate that are surfaced in the above tables seem to suggest that a network issue is the underlying problem for the deployments in this cluster. This is somewhat (but not completely) consistent with our knowledge, as one of the deployments in this cluster did already get diagnosed with sdn-issue.

Example Potential Diagnosis

In this section, we'll apply the above analysis to deployments of a specific cluster, for which we don't have a defined diagnosis yet. The goal is to see if we can extract useful information that could hint towards a potential diagnosis for deployments of this cluster.

[25]
# specific cluster for which to extract frequent patterns
clustid = 0

# indexer into X_df. True where the deployment belongs to the cluster of id ==clustid
is_depl_in_clust = dbscan.labels_ == clustid

# how many deployments in this particular cluster
display(
    HTML(
        f"<h3> Num Deployments in this Cluster = {is_depl_in_clust.sum()}</h3>"
    )
)
display(
    HTML(
        f"<h3> Percent Deployments Assigned to this Cluster = {is_depl_in_clust.mean()}</h3>"
    )
)

Num Deployments in this Cluster = 17

Percent Deployments Assigned to this Cluster = 0.04450261780104712

[26]
# "average" symptom vector for this cluster
curr_clust_mean_symptom = (
    X_df[is_depl_in_clust].mean().to_frame("percent_affected")
)
display(HTML("<h3>Most Affecting Symptoms</h3>"))
display(
    curr_clust_mean_symptom.sort_values(
        "percent_affected", ascending=False
    ).head(10)
)

# "average" symptom vector for depls not in this cluster
other_clusts_mean_symptom = (
    X_df[~is_depl_in_clust].mean().to_frame("percent_affected")
)

# what symptoms have different occurence frequencies as compared to other clusters
diff = pd.merge(
    curr_clust_mean_symptom,
    other_clusts_mean_symptom,
    how="left",
    left_index=True,
    right_index=True,
    suffixes=("_this", "_others"),
)

# display in descending order of difference
display(HTML("<h3>Most <i>UNIQUELY</i> Affecting Symptoms</h3>"))
diff.reindex(
    (diff["percent_affected_this"] - diff["percent_affected_others"])
    .sort_values(ascending=False)
    .index
).head(10)

Most Affecting Symptoms

percent_affected
symptom_id
alert|openshift-monitoring|KubePodNotReady 1.000000
alert|openshift-monitoring|TargetDown|prometheus-adapter 1.000000
alert|openshift-monitoring|ClusterMonitoringOperatorReconciliationErrors 1.000000
foc|monitoring|Degraded|UpdatingprometheusAdapterFailed|e0dc 0.941176
alert|openshift-monitoring|KubeContainerWaiting 0.882353
alert|default|AggregatedAPIDown 0.882353
alert|KubeClientErrors 0.764706
foc|monitoring|Progressing|RollOutInProgress|f311 0.529412
alert|openshift-cloud-credential-operator|CloudCredentialOperatorProvisioningFailed 0.411765
rule|nodes_requirements_check|NODES_MINIMUM_REQUIREMENTS_NOT_MET 0.411765

Most UNIQUELY Affecting Symptoms

percent_affected_this percent_affected_others
symptom_id
alert|openshift-monitoring|TargetDown|prometheus-adapter 1.000000 0.005479
alert|openshift-monitoring|ClusterMonitoringOperatorReconciliationErrors 1.000000 0.010959
foc|monitoring|Degraded|UpdatingprometheusAdapterFailed|e0dc 0.941176 0.005479
alert|openshift-monitoring|KubePodNotReady 1.000000 0.117808
alert|default|AggregatedAPIDown 0.882353 0.010959
alert|openshift-monitoring|KubeContainerWaiting 0.882353 0.024658
alert|KubeClientErrors 0.764706 0.008219
foc|monitoring|Progressing|RollOutInProgress|f311 0.529412 0.002740
foc|cloud-credential|Degraded|CredentialsFailing|337e 0.411765 0.002740
foc|cloud-credential|Progressing|Reconciling|b7a8 0.411765 0.002740
[27]
# frequent symptom cominations in depls of this cluster
curr_clust_combinations = get_frequent_symptom_combinations(
    X_df[is_depl_in_clust],
    min_support=0.25,
    drop_singles=True,
)
display(HTML("<h3>Most Affecting Symptom Combinations</h3>"))
display(curr_clust_combinations.head(10))

# frequent symptom patterns in depls of all other clusters
# NOTE: since num total depls is high, and it can contain depls of different types, min_support is kept v low (~5%)
other_clusts_combinations = get_frequent_symptom_combinations(
    X_df[~is_depl_in_clust],
    min_support=0.05,
    drop_singles=True,
)

# what symptom patterns have different frequencies in this cluster as compared to other clusters
diff = pd.merge(
    curr_clust_combinations,
    other_clusts_combinations,
    how="left",
    left_index=True,
    right_index=True,
    suffixes=("_this", "_others"),
).fillna(0.05)

# display in descending order of difference
display(HTML("<h3>Most <i>UNIQUELY</i> Affecting Symptom Combinations</h3>"))
diff.reindex(
    (diff["percent_affected_this"] - diff["percent_affected_others"])
    .sort_values(ascending=False)
    .index
).head(10)

Most Affecting Symptom Combinations

percent_affected
symptom_combination
(alert|openshift-monitoring|TargetDown|prometheus-adapter, alert|openshift-monitoring|KubePodNotReady) 1.000000
(alert|openshift-monitoring|TargetDown|prometheus-adapter, alert|openshift-monitoring|ClusterMonitoringOperatorReconciliationErrors) 1.000000
(alert|openshift-monitoring|TargetDown|prometheus-adapter, alert|openshift-monitoring|ClusterMonitoringOperatorReconciliationErrors, alert|openshift-monitoring|KubePodNotReady) 1.000000
(alert|openshift-monitoring|ClusterMonitoringOperatorReconciliationErrors, alert|openshift-monitoring|KubePodNotReady) 1.000000
(alert|openshift-monitoring|KubePodNotReady, alert|openshift-monitoring|ClusterMonitoringOperatorReconciliationErrors, foc|monitoring|Degraded|UpdatingprometheusAdapterFailed|e0dc) 0.941176
(alert|openshift-monitoring|TargetDown|prometheus-adapter, alert|openshift-monitoring|KubePodNotReady, alert|openshift-monitoring|ClusterMonitoringOperatorReconciliationErrors, foc|monitoring|Degraded|UpdatingprometheusAdapterFailed|e0dc) 0.941176
(alert|openshift-monitoring|TargetDown|prometheus-adapter, alert|openshift-monitoring|KubePodNotReady, foc|monitoring|Degraded|UpdatingprometheusAdapterFailed|e0dc) 0.941176
(alert|openshift-monitoring|TargetDown|prometheus-adapter, alert|openshift-monitoring|ClusterMonitoringOperatorReconciliationErrors, foc|monitoring|Degraded|UpdatingprometheusAdapterFailed|e0dc) 0.941176
(alert|openshift-monitoring|TargetDown|prometheus-adapter, foc|monitoring|Degraded|UpdatingprometheusAdapterFailed|e0dc) 0.941176
(alert|openshift-monitoring|KubePodNotReady, foc|monitoring|Degraded|UpdatingprometheusAdapterFailed|e0dc) 0.941176

Most UNIQUELY Affecting Symptom Combinations

percent_affected_this percent_affected_others
symptom_combination
(alert|openshift-monitoring|TargetDown|prometheus-adapter, alert|openshift-monitoring|KubePodNotReady) 1.000000 0.05
(alert|openshift-monitoring|TargetDown|prometheus-adapter, alert|openshift-monitoring|ClusterMonitoringOperatorReconciliationErrors) 1.000000 0.05
(alert|openshift-monitoring|TargetDown|prometheus-adapter, alert|openshift-monitoring|ClusterMonitoringOperatorReconciliationErrors, alert|openshift-monitoring|KubePodNotReady) 1.000000 0.05
(alert|openshift-monitoring|ClusterMonitoringOperatorReconciliationErrors, alert|openshift-monitoring|KubePodNotReady) 1.000000 0.05
(alert|openshift-monitoring|TargetDown|prometheus-adapter, alert|openshift-monitoring|ClusterMonitoringOperatorReconciliationErrors, foc|monitoring|Degraded|UpdatingprometheusAdapterFailed|e0dc) 0.941176 0.05
(alert|openshift-monitoring|KubePodNotReady, foc|monitoring|Degraded|UpdatingprometheusAdapterFailed|e0dc) 0.941176 0.05
(alert|openshift-monitoring|TargetDown|prometheus-adapter, foc|monitoring|Degraded|UpdatingprometheusAdapterFailed|e0dc) 0.941176 0.05
(alert|openshift-monitoring|ClusterMonitoringOperatorReconciliationErrors, foc|monitoring|Degraded|UpdatingprometheusAdapterFailed|e0dc) 0.941176 0.05
(alert|openshift-monitoring|TargetDown|prometheus-adapter, alert|openshift-monitoring|KubePodNotReady, foc|monitoring|Degraded|UpdatingprometheusAdapterFailed|e0dc) 0.941176 0.05
(alert|openshift-monitoring|TargetDown|prometheus-adapter, alert|openshift-monitoring|KubePodNotReady, alert|openshift-monitoring|ClusterMonitoringOperatorReconciliationErrors, foc|monitoring|Degraded|UpdatingprometheusAdapterFailed|e0dc) 0.941176 0.05

Interpretation

It seems (to a non-SME eye) that the main issue for the deployments in this cluster is that prometheus or cluster monitoring operator is not deployed correctly. The symptoms that point towards this are alert|openshift-monitoring|TargetDown|prometheus-adapter, foc|monitoring|Degraded|UpdatingprometheusAdapterFailed|e0dc, alert|openshift-monitoring|ClusterMonitoringOperatorReconciliationErrors, foc|monitoring|Progressing|RollOutInProgress|f311, etc.

Conclusion and Next Steps

The results above do not seem as definitive as those obtained by running this analysis on production data. When run on data collected from actual production deployments, we were able to surface the exact symptoms and combinations that engineers had used to define the existing sdn-issue, kubelet-down, and DefaultSecurityContextConstraints_Mutated diagnoses.

One possible reason why we do not see the same results here is that CI/CD deployments do not have the same workloads as production deployments. Therefore, the amount and variety of symptoms we have is limited. Nonetheless, as seen in the above example, these recommendations can still be quite helpful and save engineers time in determining the underlying problem.

Considering the results here and in the notebook with production data, we can conclude that:

  1. There is some probability that, if deployments are assigned to the same group/cluster by the clustering algorithm, then many of them share the same or related diagnosis.
  2. Determining the most frequent (and most "characteristic") symptoms and symptom combinations using frequent pattern mining can hint towards the underlying diagnosis.

Therefore, ML techniques such as clustering and pattern mining can indeed be used to identify and define new diagnoses.

For the next steps, SME's should try to perform the analysis done in Example Potential Diagnosis for rest of the cluster (group) ids, to see what other symptom patterns could be made into diagnosis definitions. To do this, simply change the value of the variable clustid and re-run the cells in that section.