Telemetry Data for CI Clusters

Every cluster running an OpenShift CI job sends some operational data back to Red Hat via Telemetry. This data gets stored as Prometheus metrics in a Thanos deployment at Red Hat. Some examples of the prometheus metrics collected here include CPU and memory capacity, operators installed, alerts fired, provider platform, etc. Thus, in addition to high level test run data on testgrid and prow, we also have detailed time series data available for the CI clusters that ran the tests.

In this notebook, we will show how to access this telemetry data using some open source tools developed by the AIOps team. Specifically we will show that, given a specific CI job run, how to get the telemetry data associated with the cluster that ran it. In addition, we will show how to get the pass/fail status of a given build of a job, from a given point in time. Then we will compare the telemetry metrics of a passing build with those of failing build within the same job, and see if we can get insights into the build failure.

NOTE: Since this data is currently hosted on a Red Hat internal Thanos, only those users with access to it will be able to run this notebook to get "live" data. To ensure that the wider open source community is also able to use this data for further analysis, we will use this notebook to extract a snippet of this data and save it on our public GitHub repo.

[1]

# import all the required libraries
import os
import warnings
import datetime as dt
from tqdm.notebook import tqdm
from dotenv import load_dotenv, find_dotenv
from urllib3.exceptions import InsecureRequestWarning
import requests

import pandas as pd

from matplotlib import pyplot as plt
import seaborn as sns

from prometheus_api_client import (
    PrometheusConnect,
    MetricSnapshotDataFrame,
    MetricRangeDataFrame,
)

import sys

sys.path.insert(1, "../TestGrid/metrics")
from ipynb.fs.defs.metric_template import save_to_disk  # noqa: E402

load_dotenv(find_dotenv())

True

[2]

# config for a pretty notebook
sns.set()
load_dotenv(find_dotenv())
warnings.filterwarnings("ignore", category=InsecureRequestWarning)

pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)

Data Access Setup

In this section, we will configure the prometheus-api-client-python tool to pull data from our Thanos instance. That is, set the value of PROM_URL to the Thanos endpoint, and set the value of PROM_ACCESS_TOKEN to the bearer token for authentication. We will also set the timestamp from which telemetry data is to be pulled.

In order to get access to the token, you can follow either one of these steps: 1. Visit https://datahub.psi.redhat.com/. Click on your profile (top right) and select Copy Login Command from the drop down menu. This will copy a command that will look something like: oc login https://datahub.psi.redhat.com:443 --token=<YOUR_TOKEN>. The value in YOUR_TOKEN is the required token. 2. From the command line, run oc whoami --show-token. Ensure that the output of oc project is https://datahub.psi.redhat.com/. This will output the required token.

NOTE: The above methods can only used if you are on Red Hat VPN.

[3]

# prometheus from which metrics are to be fetched
PROM_URL = os.getenv("PROM_URL")
PROM_ACCESS_TOKEN = os.getenv("PROM_ACCESS_TOKEN")

[4]

# prometheus connector object
pc = PrometheusConnect(
    url=PROM_URL,
    disable_ssl=True,
    headers={"Authorization": f"bearer {PROM_ACCESS_TOKEN}"},
)

[5]

# which metrics to fetch
# we will try to get all metrics, but leave out ones that may have potentially sensitive data
metrics_to_fetch = [
    m
    for m in pc.all_metrics()
    if "subscription" not in m and "internal" not in m and "url" not in m
]

[6]

# these fields are either irrelevant or contain something that could potentially be sensitive
# either way, these likely wont be useful for analysis anyway so exclude them when reading data
drop_cols = [
    "prometheus",
    "tenant_id",
    "endpoint",
    "instance",
    "receive",
    "url",
]

Get All Data for Job Build

In this section, we will get all the prometheus metrics corresponding to a given job name, build id and fixed date. If you want to extract the information for another job name and build id, please ensure that Thanos actually contains the data for this job/build at the corresponding timestamp at which you're evaluating the query. The job name and build id can be obtained either directly from the testgrid UI, or from the query and changelists fields respectively in the testgrid json as shown in the testgrid metadata EDA notebook.

One of the metrics stored in Thanos is cluster_installer. This metric describes what entity triggered the install of each cluster. For the clusters that run OpenShift CI jobs, the invoker label value in this metric is set to openshift-internal-ci/{job_name}/{build_id}.

Therefore, we can get all data for a given job build by first finding the ID of the cluster that ran it (using cluster_installer), and then querying prometheus for metrics where the _id label value equals this cluster ID. These steps are demonstrated through the example below. We will be choosing two jobs and build_id and later to compare their telemetry metric values.

[7]

# timestamp for which prometheus queries will be evaluated
query_eval_time = dt.datetime(
    year=2022, month=1, day=18, hour=21, minute=53, second=51, tzinfo=dt.timezone.utc
)
query_eval_ts = query_eval_time.timestamp()

1642542831.0

[8]

# example job and build
job_name = "periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-upgrade"

build_id_1 = "1483543721860403200"
build_id_2 = "1483543720656637952"

[9]

# get installer info for the job/build
job_build_cluster_installer1 = pc.custom_query(
    query=f'cluster_installer{{invoker="openshift-internal-ci/{job_name}/{build_id_1}"}}',
    params={"time": query_eval_ts},
)
job_build_cluster_installer2 = pc.custom_query(
    query=f'cluster_installer{{invoker="openshift-internal-ci/{job_name}/{build_id_2}"}}',
    params={"time": query_eval_ts},
)

# extract cluster id out of the installer info metric
cluster_id1 = job_build_cluster_installer1[0]["metric"]["_id"]
cluster_id2 = job_build_cluster_installer2[0]["metric"]["_id"]

Overall Status

We are able to to get the status of past builds from the historical testgrid data that we are collecting. That is how we are calculating build pass/fail metrics over time. However, I am not sure if we are able to get the build_id from the Testgrid data. Hence, it makes more sense to get it from the prow logs. The overall status (pass/fail) of the past job and build id can be found in finished.json file in build log data (example link).

Here we are checking the status of the two job using prow log.

[10]

prow_log1 = requests.get(
    f"https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com"
    f"/gcs/origin-ci-test/logs/{job_name}/{build_id_1}/finished.json"
)

prow_log2 = requests.get(
    f"https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com"
    f"/gcs/origin-ci-test/logs/{job_name}/{build_id_2}/finished.json"
)

status1 = prow_log1.json()["result"]
status2 = prow_log2.json()["result"]

[11]

print(
    f"The status of the first build is {status1}"
    f" and the status of second build is {status2}."
)

The status of the first build is SUCCESS and the status of second build is FAILURE.

In the next step, we will calculate the metrics for the two builds.

Get One Metric

Before we fetch all the metrics, let's fetch just one metric and familiarize ourselves with the data format, and understand how to interpret it. In the cell below, we will look at an example metric, cluster:cpu_capacity:sum.

[12]

# fetch the metric and format it into a df
metric_df1 = MetricSnapshotDataFrame(
    pc.custom_query(
        query=f'cluster:capacity_cpu_cores:sum{{_id="{cluster_id1}"}}',
        params={"time": query_eval_ts},
    )
)

metric_df2 = MetricSnapshotDataFrame(
    pc.custom_query(
        query=f'cluster:capacity_cpu_cores:sum{{_id="{cluster_id2}"}}',
        params={"time": query_eval_ts},
    )
)

metric_df1.drop(columns=drop_cols, errors="ignore", inplace=True)
metric_df2.drop(columns=drop_cols, errors="ignore", inplace=True)

Metric for SUCCESS job

[13]

metric_df1

	__name__	_id	label_beta_kubernetes_io_instance_type	label_kubernetes_io_arch	label_node_openshift_io_os_id	timestamp	value	label_node_role_kubernetes_io
0	cluster:capacity_cpu_cores:sum	59a54b9c-d0fd-4b1a-b119-c24fb08fa03c	m5.xlarge	amd64	rhcos	1642542831	12	NaN
1	cluster:capacity_cpu_cores:sum	59a54b9c-d0fd-4b1a-b119-c24fb08fa03c	m6i.xlarge	amd64	rhcos	1642542831	12	master

Metric for FAILED job

[14]

metric_df2

	__name__	_id	label_beta_kubernetes_io_instance_type	label_kubernetes_io_arch	label_node_openshift_io_os_id	timestamp	value	label_node_role_kubernetes_io
0	cluster:capacity_cpu_cores:sum	026e1f8a-82af-4b42-8ed3-0f6af0d7ebea	m5.xlarge	amd64	rhcos	1642542831	12	NaN
1	cluster:capacity_cpu_cores:sum	026e1f8a-82af-4b42-8ed3-0f6af0d7ebea	m6i.xlarge	amd64	rhcos	1642542831	12	master

HOW TO READ THIS DATAFRAME

In the above dataframe, each column represents a "label" of the prometheus metric, and each row represents a different "label configuration". In this example, the first row has label_node_role_kubernetes_io = NaN and value = 12, and the second row has label_node_role_kubernetes_io = master and value = 12. This means that in this cluster, the master node had 12 CPU cores, and the worker node also had 12 CPU cores.

To learn more about labels, label configurations, and the prometheus data model in general, please check out their official documentation here.

Observing the above metrics value, we see that the value is 12 cores for both SUCCESS and FAILED build.

Get All Metrics

Now that we understand the data structure of the metrics, let's fetch all the metrics and concatenate them into one single dataframe.

Metrics for testgrid SUCCESSFUL build

[15]

# let's combine all the metrics into one dataframe
# for the above mentioned job name and build name.

all_metrics_df = pd.DataFrame()
for metric in metrics_to_fetch:
    metric_df = MetricSnapshotDataFrame(
        pc.custom_query(
            query=f'{metric}{{_id="{cluster_id1}"}}',
            params={"time": query_eval_ts},
        )
    )
    if len(metric_df) > 0:
        metric_df.drop(columns=drop_cols, errors="ignore", inplace=True)
        # print(f"Metric = {metric}")
        # display(metric_df.head())
        all_metrics_df = pd.concat(
            [
                all_metrics_df,
                metric_df,
            ],
            axis=0,
            join="outer",
            ignore_index=True,
        )
all_metrics_df.head(5)

	__name__	_id	alertname	alertstate	namespace	severity	timestamp	value	container	job	mode	pod	service	apiserver	label_beta_kubernetes_io_instance_type	label_kubernetes_io_arch	label_node_openshift_io_os_id	label_node_role_kubernetes_io	plugin_name	volume_mode	provisioner	networks	resource	type	region	invoker	version	condition	name	reason	from_version	image	code	metrics_path	exported_namespace	install_type	network_type	host_type	provider	client	status_code	quantile	exported_service	label_node_hyperthread_enabled	label_node_role_kubernetes_io_master
0	alerts	59a54b9c-d0fd-4b1a-b119-c24fb08fa03c	AlertmanagerReceiversNotConfigured	firing	openshift-monitoring	warning	1642542831	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	alerts	59a54b9c-d0fd-4b1a-b119-c24fb08fa03c	Watchdog	firing	openshift-monitoring	none	1642542831	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	cco_credentials_mode	59a54b9c-d0fd-4b1a-b119-c24fb08fa03c	NaN	NaN	openshift-cloud-credential-operator	NaN	1642542831	1	kube-rbac-proxy	cco-metrics	mint	cloud-credential-operator-784f994fff-lsxz8	cco-metrics	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	cluster:alertmanager_integrations:max	59a54b9c-d0fd-4b1a-b119-c24fb08fa03c	NaN	NaN	NaN	NaN	1642542831	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	cluster:apiserver_current_inflight_requests:su...	59a54b9c-d0fd-4b1a-b119-c24fb08fa03c	NaN	NaN	NaN	NaN	1642542831	10	NaN	NaN	NaN	NaN	NaN	kube-apiserver	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

[16]

# concatenate into one dataframe
all_metricsdf1 = all_metrics_df.drop_duplicates()
all_metricsdf1.head()

	__name__	_id	alertname	alertstate	namespace	severity	timestamp	value	container	job	mode	pod	service	apiserver	label_beta_kubernetes_io_instance_type	label_kubernetes_io_arch	label_node_openshift_io_os_id	label_node_role_kubernetes_io	plugin_name	volume_mode	provisioner	networks	resource	type	region	invoker	version	condition	name	reason	from_version	image	code	metrics_path	exported_namespace	install_type	network_type	host_type	provider	client	status_code	quantile	exported_service	label_node_hyperthread_enabled	label_node_role_kubernetes_io_master
0	alerts	59a54b9c-d0fd-4b1a-b119-c24fb08fa03c	AlertmanagerReceiversNotConfigured	firing	openshift-monitoring	warning	1642542831	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	alerts	59a54b9c-d0fd-4b1a-b119-c24fb08fa03c	Watchdog	firing	openshift-monitoring	none	1642542831	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	cco_credentials_mode	59a54b9c-d0fd-4b1a-b119-c24fb08fa03c	NaN	NaN	openshift-cloud-credential-operator	NaN	1642542831	1	kube-rbac-proxy	cco-metrics	mint	cloud-credential-operator-784f994fff-lsxz8	cco-metrics	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	cluster:alertmanager_integrations:max	59a54b9c-d0fd-4b1a-b119-c24fb08fa03c	NaN	NaN	NaN	NaN	1642542831	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	cluster:apiserver_current_inflight_requests:su...	59a54b9c-d0fd-4b1a-b119-c24fb08fa03c	NaN	NaN	NaN	NaN	1642542831	10	NaN	NaN	NaN	NaN	NaN	kube-apiserver	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Metrics for testgrid FAILURE build

[17]

# let's combine all the metrics into one dataframe
# for the above mentioned job name and build name.

all_metrics_df = pd.DataFrame()
for metric in metrics_to_fetch:
    metric_df = MetricSnapshotDataFrame(
        pc.custom_query(
            query=f'{metric}{{_id="{cluster_id2}"}}',
            params={"time": query_eval_ts},
        )
    )
    if len(metric_df) > 0:
        metric_df.drop(columns=drop_cols, errors="ignore", inplace=True)
        all_metrics_df = pd.concat(
            [
                all_metrics_df,
                metric_df,
            ],
            axis=0,
            join="outer",
            ignore_index=True,
        )
all_metrics_df.head()

	__name__	_id	alertname	alertstate	namespace	severity	timestamp	value	container	job	mode	pod	service	apiserver	label_beta_kubernetes_io_instance_type	label_kubernetes_io_arch	label_node_openshift_io_os_id	label_node_role_kubernetes_io	plugin_name	volume_mode	provisioner	networks	resource	type	region	invoker	version	condition	name	reason	from_version	image	code	metrics_path	exported_namespace	install_type	network_type	host_type	provider	client	status_code	quantile	exported_service	label_node_hyperthread_enabled	label_node_role_kubernetes_io_master
0	alerts	026e1f8a-82af-4b42-8ed3-0f6af0d7ebea	AlertmanagerReceiversNotConfigured	firing	openshift-monitoring	warning	1642542831	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	alerts	026e1f8a-82af-4b42-8ed3-0f6af0d7ebea	Watchdog	firing	openshift-monitoring	none	1642542831	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	cco_credentials_mode	026e1f8a-82af-4b42-8ed3-0f6af0d7ebea	NaN	NaN	openshift-cloud-credential-operator	NaN	1642542831	1	kube-rbac-proxy	cco-metrics	mint	cloud-credential-operator-784f994fff-559g2	cco-metrics	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	cluster:alertmanager_integrations:max	026e1f8a-82af-4b42-8ed3-0f6af0d7ebea	NaN	NaN	NaN	NaN	1642542831	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	cluster:apiserver_current_inflight_requests:su...	026e1f8a-82af-4b42-8ed3-0f6af0d7ebea	NaN	NaN	NaN	NaN	1642542831	24	NaN	NaN	NaN	NaN	NaN	kube-apiserver	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

[18]

all_metricsdf2 = all_metrics_df.drop_duplicates()
all_metricsdf2.head()

	__name__	_id	alertname	alertstate	namespace	severity	timestamp	value	container	job	mode	pod	service	apiserver	label_beta_kubernetes_io_instance_type	label_kubernetes_io_arch	label_node_openshift_io_os_id	label_node_role_kubernetes_io	plugin_name	volume_mode	provisioner	networks	resource	type	region	invoker	version	condition	name	reason	from_version	image	code	metrics_path	exported_namespace	install_type	network_type	host_type	provider	client	status_code	quantile	exported_service	label_node_hyperthread_enabled	label_node_role_kubernetes_io_master
0	alerts	026e1f8a-82af-4b42-8ed3-0f6af0d7ebea	AlertmanagerReceiversNotConfigured	firing	openshift-monitoring	warning	1642542831	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	alerts	026e1f8a-82af-4b42-8ed3-0f6af0d7ebea	Watchdog	firing	openshift-monitoring	none	1642542831	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	cco_credentials_mode	026e1f8a-82af-4b42-8ed3-0f6af0d7ebea	NaN	NaN	openshift-cloud-credential-operator	NaN	1642542831	1	kube-rbac-proxy	cco-metrics	mint	cloud-credential-operator-784f994fff-559g2	cco-metrics	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	cluster:alertmanager_integrations:max	026e1f8a-82af-4b42-8ed3-0f6af0d7ebea	NaN	NaN	NaN	NaN	1642542831	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	cluster:apiserver_current_inflight_requests:su...	026e1f8a-82af-4b42-8ed3-0f6af0d7ebea	NaN	NaN	NaN	NaN	1642542831	24	NaN	NaN	NaN	NaN	NaN	kube-apiserver	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Comparing some metrics for the SUCCESSFUL and FAILED build

[19]

# For SUCCESS build:
print(all_metricsdf1.__name__.nunique())
# For FAILED build:
print(all_metricsdf2.__name__.nunique())
print("The number of metrices extracted from the two builds are different.")

69
70
The number of metrices extracted from the two builds are different.

Furthermore, In the next steps, we merge the two datasets from the two builds and sort the merged dataframe with respect to the percentage difference in values of build metrics. Here, df1 corresponds to the dataframe for successful build type and df2 for failed build type.

Percentage Difference:

D = \frac{|d_{1}-d_{2}|}{\frac{(d_{1}+d_{2})}{2}} \times 100

[20]

df1 = all_metricsdf1[["__name__", "value"]].drop_duplicates()
df1["value"] = df1["value"].astype(float)
df1 = df1.groupby(["__name__"])["value"].sum().to_frame()
df2 = all_metricsdf2[["__name__", "value"]].drop_duplicates()
df2["value"] = df2["value"].astype(float)
df2 = df2.groupby(["__name__"])["value"].sum().to_frame()

df1.rename(columns={"value": "Success_build_value"}, inplace=True)
df2.rename(columns={"value": "Failure_build_value"}, inplace=True)
df1 = df1.reset_index()
df2 = df2.reset_index()

[21]

df_merge = pd.merge(df1, df2, on="__name__", how="outer")
df_merge["diff"] = abs(
    ((df_merge["Success_build_value"] - df_merge["Failure_build_value"]) * 100)
    / ((df_merge["Success_build_value"] + df_merge["Failure_build_value"]) / 2)
)
df_merge = df_merge[
    ["__name__", "Success_build_value", "Failure_build_value", "diff"]
].sort_values(by="diff", ascending=False)

[22]

df_merge

	__name__	Success_build_value	Failure_build_value	diff
3	cluster:apiserver_current_inflight_requests:su...	1.300000e+01	2.900000e+01	76.190476
51	instance:etcd_disk_backend_commit_duration_sec...	2.132312e-02	1.381915e-02	42.706215
56	instance:etcd_object_counts:sum	2.470200e+04	3.762300e+04	41.463297
55	instance:etcd_network_peer_round_trip_time_sec...	3.778560e-02	2.521600e-02	39.902479
22	cluster:usage:openshift:ingress_request_total:...	3.400000e+00	4.466667e+00	27.118644
63	openshift:cpu_usage_cores:sum	2.572576e+00	3.335502e+00	25.826554
6	cluster:cpu_usage_cores:sum	2.574476e+00	3.337539e+00	25.813958
60	namespace_job:scrape_series_added:topk3_sum1h	2.363220e+05	3.037350e+05	24.965143
18	cluster:usage:ingress_frontend_connections:sum	1.900000e+01	2.300000e+01	19.047619
25	cluster:usage:workload:capacity_physical_cpu_c...	9.852000e+03	1.149000e+04	15.350014
52	instance:etcd_disk_wal_fsync_duration_seconds:...	1.431914e-02	1.244548e-02	14.001049
17	cluster:usage:ingress_frontend_bytes_out:rate5...	4.423026e+04	3.922549e+04	11.993840
64	openshift:memory_usage_bytes:sum	2.804829e+10	2.510586e+10	11.071283
10	cluster:memory_usage_bytes:sum	2.815404e+10	2.521127e+10	11.028780
9	cluster:kubelet_volume_stats_used_bytes:provis...	2.393539e+08	2.621563e+08	9.093508
67	workload:cpu_usage_cores:sum	1.900651e-03	2.036780e-03	6.914581
57	monitoring:container_memory_working_set_bytes:sum	5.172630e+09	5.499646e+09	6.128336
59	namespace_job:scrape_samples_post_metric_relab...	6.769200e+04	7.172500e+04	5.785521
15	cluster:usage:containers:sum	6.660000e+02	7.050000e+02	5.689278
16	cluster:usage:ingress_frontend_bytes_in:rate5m...	4.236015e+03	4.011374e+03	5.447560
41	code:apiserver_request_total:rate:sum	7.531959e+01	7.176856e+01	4.828425
54	instance:etcd_mvcc_db_total_size_in_use_in_byt...	2.299535e+08	2.410947e+08	4.730353
53	instance:etcd_mvcc_db_total_size_in_bytes:sum	2.746409e+08	2.620826e+08	4.679630
65	openshift:prometheus_tsdb_head_samples_appende...	1.770057e+04	1.695999e+04	4.273317
66	openshift:prometheus_tsdb_head_series:sum	7.357900e+05	7.655760e+05	3.967853
24	cluster:usage:resources:sum	1.218100e+04	1.261300e+04	3.484714
23	cluster:usage:openshift:kube_running_pod_ready...	9.950495e-01	1.000000e+00	0.496278
68	workload:memory_usage_bytes:sum	1.057546e+08	1.054065e+08	0.329758
14	cluster:telemetry_selected_series:count	5.050000e+02	5.040000e+02	0.198216
5	cluster:capacity_memory_bytes:sum	9.886904e+10	9.904518e+10	0.177992
39	cluster_version	6.570163e+09	6.570162e+09	0.000009
61	node_role_os_version_machine:cpu_capacity_core...	6.000000e+00	6.000000e+00	0.000000
45	id_network_type	7.000000e+00	7.000000e+00	0.000000
62	node_role_os_version_machine:cpu_capacity_sock...	3.000000e+00	3.000000e+00	0.000000
50	insightsclient_request_send_total	1.000000e+00	1.000000e+00	0.000000
49	id_version:cluster_available	1.000000e+00	1.000000e+00	0.000000
0	alerts	1.000000e+00	1.000000e+00	0.000000
43	csv_succeeded	1.000000e+00	1.000000e+00	0.000000
27	cluster:usage:workload:capacity_physical_cpu_c...	6.000000e+00	6.000000e+00	0.000000
4	cluster:capacity_cpu_cores:sum	1.200000e+01	1.200000e+01	0.000000
7	cluster:kube_persistentvolume_plugin_type_coun...	2.000000e+00	2.000000e+00	0.000000
8	cluster:kube_persistentvolumeclaim_resource_re...	2.147484e+10	2.147484e+10	0.000000
13	cluster:node_instance_type_count:sum	3.000000e+00	3.000000e+00	0.000000
19	cluster:usage:kube_node_ready:avg5m	1.000000e+00	1.000000e+00	0.000000
20	cluster:usage:kube_schedulable_node_ready_reac...	1.000000e+00	1.000000e+00	0.000000
26	cluster:usage:workload:capacity_physical_cpu_c...	6.000000e+00	6.000000e+00	0.000000
30	cluster:usage:workload:kube_running_pod_ready:avg	1.000000e+00	1.000000e+00	0.000000
42	count:up1	1.200000e+01	1.200000e+01	0.000000
31	cluster:virt_platform_nodes:sum	6.000000e+00	6.000000e+00	0.000000
32	cluster_feature_set	1.000000e+00	1.000000e+00	0.000000
33	cluster_infrastructure_provider	1.000000e+00	1.000000e+00	0.000000
1	cco_credentials_mode	1.000000e+00	1.000000e+00	0.000000
37	cluster_operator_conditions	1.000000e+00	1.000000e+00	0.000000
38	cluster_operator_up	1.000000e+00	1.000000e+00	0.000000
40	cluster_version_payload	7.690000e+02	7.690000e+02	0.000000
34	cluster_installer	1.000000e+00	1.000000e+00	0.000000
2	cluster:alertmanager_integrations:max	0.000000e+00	0.000000e+00	NaN
11	cluster:network_attachment_definition_enabled_...	0.000000e+00	0.000000e+00	NaN
12	cluster:network_attachment_definition_instance...	0.000000e+00	0.000000e+00	NaN
21	cluster:usage:openshift:ingress_request_error:...	0.000000e+00	0.000000e+00	NaN
28	cluster:usage:workload:ingress_request_error:f...	0.000000e+00	0.000000e+00	NaN
29	cluster:usage:workload:ingress_request_total:i...	0.000000e+00	0.000000e+00	NaN
35	cluster_legacy_scheduler_policy	0.000000e+00	0.000000e+00	NaN
36	cluster_master_schedulable	0.000000e+00	0.000000e+00	NaN
44	id_install_type	0.000000e+00	0.000000e+00	NaN
46	id_primary_host_type	0.000000e+00	0.000000e+00	NaN
47	id_provider	0.000000e+00	0.000000e+00	NaN
48	id_version	0.000000e+00	0.000000e+00	NaN
58	monitoring:haproxy_server_http_responses_total...	0.000000e+00	0.000000e+00	NaN
69	count:up0	NaN	1.000000e+00	NaN

In the next case, we are plotting some metrics values listed in the metric_list below. Using the plot we are able to compare the value for both successful and failed build types.

[23]

metric_list = df_merge.__name__.iloc[:5]
metric_list

3     cluster:apiserver_current_inflight_requests:su...
51    instance:etcd_disk_backend_commit_duration_sec...
56                      instance:etcd_object_counts:sum
55    instance:etcd_network_peer_round_trip_time_sec...
22    cluster:usage:openshift:ingress_request_total:...
Name: __name__, dtype: object

[24]

filter_series1 = df_merge.__name__.isin(metric_list)
df_mergef = df_merge[filter_series1]

plt.figure(figsize=(20, 15))

plt.subplot(2, 1, 1)
plt.scatter(
    df_mergef["__name__"],
    df_mergef["Success_build_value"],
    label="success build",
    marker="s",
    s=100,
)
plt.scatter(
    df_mergef["__name__"],
    df_mergef["Failure_build_value"],
    label="failed build",
    marker="o",
    s=100,
)
plt.title("Metrics comparison for success and failed build")
plt.ylabel("metric value", fontsize=18)
plt.xticks([])
plt.yticks(fontsize=14)
plt.legend()

plt.subplot(2, 1, 2)
plt.scatter(
    df_mergef["__name__"],
    df_mergef["diff"],
    label="Percentage Difference",
    marker="o",
    s=100,
)
plt.title("Difference in metric values")
plt.xlabel("metrics name", fontsize=18)
plt.ylabel("Percentage difference", fontsize=18)
plt.xticks(fontsize=14, rotation=45)
plt.yticks(fontsize=14)
plt.legend()

<matplotlib.legend.Legend at 0x7fef1ca51af0>

[25]

df_mergef

	__name__	Success_build_value	Failure_build_value	diff
3	cluster:apiserver_current_inflight_requests:su...	13.000000	29.000000	76.190476
51	instance:etcd_disk_backend_commit_duration_sec...	0.021323	0.013819	42.706215
56	instance:etcd_object_counts:sum	24702.000000	37623.000000	41.463297
55	instance:etcd_network_peer_round_trip_time_sec...	0.037786	0.025216	39.902479
22	cluster:usage:openshift:ingress_request_total:...	3.400000	4.466667	27.118644

We plotted the metric values from the above table along with their respective percentage difference. Even though we do see from the plots that some point for success and failed build are overlapping. The percentage difference plot clarifies that the two are different and not of same value.

Get Data for Multiple Builds for a Given Job

In this section, we will fetch all the telemetry metrics from all timestamps for the top 10 most recent builds for a given job. This data can help understand how the behavior of the available metrics changed over time, across builds.

[26]

# fetch data from this number of builds for this job
NBUILDS = 10

# number of previous days of data to search to get the last n builds data for this job
NDAYS = 2

# max runtime of a build
# NOTE: this is a (over)estimate number derived from SME conversations, as well as time duration from testgrid
MAX_DURATION_HRS = 12

[27]

# get invoker details
prev_ndays_invokers = MetricRangeDataFrame(
    pc.custom_query_range(
        query=f'max by (_id, invoker) (cluster_installer{{invoker=~"^openshift-internal-ci/{job_name}.*"}})',
        end_time=query_eval_time,
        start_time=query_eval_time - dt.timedelta(days=NDAYS),
        step="5m",
    )
).sort_index()

# split invoker name into prefix, job id, build id.
prev_ndays_invokers[["prefix", "job_name", "build_id"]] = prev_ndays_invokers[
    "invoker"
].str.split("/", expand=True)

# drop now redundant columns.
prev_ndays_invokers.drop(columns=["invoker", "prefix", "value"], inplace=True)

# drop irrelevant columns.
prev_ndays_invokers.drop(columns=drop_cols, errors="ignore", inplace=True)

prev_ndays_invokers.head()

	_id	job_name	build_id
timestamp
1642388700	e249fcab-e322-462c-bfbd-6f9cf96a598b	periodic-ci-openshift-release-master-nightly-4...	1482905366760001536
1642388700	740192b4-71f1-490d-90d9-0d494c7d2e17	periodic-ci-openshift-release-master-nightly-4...	1482905366944550912
1642388700	029b1b45-1ecf-4e9b-b36e-59bc5af9a329	periodic-ci-openshift-release-master-nightly-4...	1482905366688698368
1642389000	474c0eef-6bef-47ed-b400-1b173f5f4633	periodic-ci-openshift-release-master-nightly-4...	1482905366797750272
1642389000	740192b4-71f1-490d-90d9-0d494c7d2e17	periodic-ci-openshift-release-master-nightly-4...	1482905366944550912

[28]

# for each build, get cluster id and then the corresponding metrics from all timestamps
all_metrics_df = pd.DataFrame()

for build_id in tqdm(prev_ndays_invokers["build_id"].unique()[:NBUILDS]):

    job_build_cluster_installer = pc.custom_query_range(
        query=f'cluster_installer{{invoker="openshift-internal-ci/{job_name}/{build_id}"}}',
        end_time=query_eval_time,
        start_time=query_eval_time
        - dt.timedelta(days=NDAYS)
        - dt.timedelta(days=MAX_DURATION_HRS),
        step="5m",
    )

    # extract cluster id out of the installer info metric
    cluster_id = job_build_cluster_installer[0]["metric"]["_id"]

    # get all telemetry time series
    for metric in metrics_to_fetch:

        # fetch the metric
        metric_result = pc.custom_query_range(
            query=f'{metric}{{_id="{cluster_id}"}}',
            end_time=query_eval_time,
            start_time=query_eval_time
            - dt.timedelta(days=NDAYS)
            - dt.timedelta(days=MAX_DURATION_HRS),
            step="5m",
        )

        if len(metric_result) > 0:
            metric_df = MetricRangeDataFrame(metric_result).reset_index(drop=False)

            # drop irrelevant cols, if any
            metric_df.drop(columns=drop_cols, errors="ignore", inplace=True)

            # combine all the metrics data.
            all_metrics_df = pd.concat(
                [
                    all_metrics_df,
                    metric_df,
                ],
                axis=0,
                join="outer",
                ignore_index=True,
            )

all_metrics_df["value"] = all_metrics_df["value"].astype(float)

  0%|          | 0/10 [00:00<?, ?it/s]

[29]

# visualize time series behavior across builds
for metric in all_metrics_df["__name__"].unique():
    plt.figure(figsize=(15, 5))

    metric_df = all_metrics_df[all_metrics_df["__name__"] == metric][
        ["_id", "timestamp", "value"]
    ]
    metric_df.set_index("timestamp").groupby("_id").value.plot(legend=True)

    plt.xlabel("timestamp")
    plt.ylabel("value")
    plt.legend(loc="best")
    plt.title(metric)
    plt.show()

[30]

# save the metrics as a static dataset to use in future
save_to_disk(
    all_metrics_df,
    "../../../data/raw/",
    f"telemetry-{query_eval_time.year}-{query_eval_time.month}-{query_eval_time.day}.parquet",
)

True

Conclusion

In this notebook, we have :

Checked the status of (pass/fail) builds of corresponding job.
Collected all telemetry data corresponding to a given job and build.
Compared the telemetry data for both SUCCESSFUL and FAILED builds of corresponding job.
Understood how to interpret Prometheus data using an example metric.
Collected all telemetry data from all timestamps for the top 10 most recent builds for a given job.
Visualized what the general time series behavior of metrics looks like across builds.
Saved the above data for further analysis.