ODH Logo

Quantify Flakes

One of the key perfomance indicators that we would like to create greater visbility into and track over time is overall number and percent of flakes that occur. Individual test runs are flagged a "flake" if they are run mulitple times with a mix of passes and failes without any changes to the code being tested. Although they occur for individual test runs, there are a number of aggregate views that developers may want to look at to assess the overall health of thier project or testing platform. For Example:

  • percent flakes on platform each day
  • percent flakes by tab each week
  • percent flakes by grid each month
  • percent flakes by test overall (this can also be seen as a severity level = overall flake rate of test)

In order to provide maxium flexibility for the end-user of this work, instead of creating a number of dataframes to answer each of these specifc questions, we will define a long and narrow data structure (a list of tuples saved as a csv for now) that contains only 5 columns ("timestamp", "tab","grid","test","flake"). This allows superset (or pandas) to perform the last filter and/or aggreagtion of interest to an end user. Which is to say, there may appear to be a lot of repetion within the final dataset, but each row should be unique, and it should provide the simpelest useability for an end-user.

[1]
import json
import gzip
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime


from ipynb.fs.defs.metric_template import testgrid_labelwise_encoding
from ipynb.fs.defs.metric_template import CephCommunication
from ipynb.fs.defs.metric_template import save_to_disk, read_from_disk
from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv())
True
[2]
## Specify variables
METRIC_NAME = "number_of_flakes"
# Specify the path for input grid data,
INPUT_DATA_PATH = "../../../../data/raw/testgrid_183.json.gz"

# Specify the path for output metric data
OUTPUT_DATA_PATH = f"../../../../data/processed/metrics/{METRIC_NAME}"

# Specify whether or not we are running this as a notebook or part of an automation pipeline.
AUTOMATION = os.getenv("IN_AUTOMATION")

## CEPH Bucket variables
## Create a .env file on your local with the correct configs,
s3_endpoint_url = os.getenv("S3_ENDPOINT")
s3_access_key = os.getenv("S3_ACCESS_KEY")
s3_secret_key = os.getenv("S3_SECRET_KEY")
s3_bucket = os.getenv("S3_BUCKET")
s3_input_data_path = "raw_data"
metric_path = f"ai4ci/testgrid/metrics/{METRIC_NAME}"
[3]
## Import data
timestamp = datetime.datetime.today()

if AUTOMATION:
    filename = f"testgrid_{timestamp.day}{timestamp.month}.json"
    cc = CephCommunication(s3_endpoint_url, s3_access_key, s3_secret_key, s3_bucket)
    s3_object = cc.s3_resource.Object(s3_bucket, f"{s3_input_data_path}/{filename}")
    file_content = s3_object.get()["Body"].read().decode("utf-8")
    testgrid_data = json.loads(file_content)

else:
    with gzip.open(INPUT_DATA_PATH, "rb") as read_file:
        testgrid_data = json.load(read_file)

Calculation

  • In this section, we calculate the metric values from the data.
[4]
unrolled_list = testgrid_labelwise_encoding(testgrid_data, 13)
[5]
# Convert to dataframe
df = pd.DataFrame(
    unrolled_list,
    columns=["timestamp", "tab", "grid", "test", "test_duration", "flake"],
)
df = df.drop(columns="test_duration")
[6]
df.head()
timestamp tab grid test flake
0 2021-03-15 23:40:20 "redhat-assisted-installer" periodic-ci-openshift-release-master-nightly-4... Overall False
1 2021-03-15 00:01:06 "redhat-assisted-installer" periodic-ci-openshift-release-master-nightly-4... Overall False
2 2021-03-13 20:51:32 "redhat-assisted-installer" periodic-ci-openshift-release-master-nightly-4... Overall False
3 2021-03-13 07:51:20 "redhat-assisted-installer" periodic-ci-openshift-release-master-nightly-4... Overall False
4 2021-03-13 06:43:20 "redhat-assisted-installer" periodic-ci-openshift-release-master-nightly-4... Overall False

Flake Severity Metric

Some tests would be flaky for a lot of builds and thus have a high flake rate or severity. This metric can be leveraged to help prioritize the work needed to resolve the flake issues.

[7]
df_flake_severity = df[["test", "flake"]]
df_flake_severity.describe()
test flake
count 27192485 27192485
unique 15157 2
top Overall False
freq 37799 27142857
[8]
## The following metric implements flake severity
## or flake rate by tests
## Moving forward, this will be aggregated in Superset
## For the sake of completeness, it is implmented here
flake_severity = df.groupby("test").flake.mean().reset_index()
flake_severity.rename(columns={"flake": "flake_severity"}, inplace=True)
flake_severity
test flake_severity
0 Add Secret to Workloads.Add Secret to Workload... 0.0
1 Add Secret to Workloads.Add Secret to Workload... 0.0
2 Alertmanager: Configuration.creates a receiver... 0.0
3 Alertmanager: Configuration.deletes a receiver... 0.0
4 Alertmanager: Configuration.displays the Alert... 0.0
... ... ...
15152 user.openshift.io~v1~Group.Kubernetes resource... 0.0
15153 user.openshift.io~v1~Group.Kubernetes resource... 0.0
15154 user.openshift.io~v1~Group.Kubernetes resource... 0.0
15155 user.openshift.io~v1~Group.Kubernetes resource... 0.0
15156 user.openshift.io~v1~Group.Kubernetes resource... 0.0

15157 rows × 2 columns

Visualization

  • Here, we provide a quick visualization of the computed metric.
[9]
sns.set(rc={"figure.figsize": (15, 10)})
flake_severity.hist()
plt.ylabel("Tests")
plt.xlabel("Flake Severity")
plt.show()

From the above graph we can conclude that most of the tests have very low(~0) flake severity. One might look at this graph and assume that almost all of the tests have flake severity as 0. So, to get more clarity we have plotted another graph and here we've gone from about 8,000 tests to investigate for probable flakiness to less than 100.

[10]
sns.set(rc={"figure.figsize": (15, 10)})
flake_severity.hist(bins=50)
plt.ylabel("Tests")
plt.xlabel("Flake Severity")
plt.ylim((0, 25))
plt.show()

From the above graph we can see that there are other values of flake severity for some tests. However it is a small bunch of tests that show a different set of flake severity. Let's see some tests that have a high flake severity score.

The top 5 tests with the highest flake score :

[11]
flake_severity.nlargest(5, "flake_severity")
test flake_severity
5793 openshift-tests.[sig-arch] Monitor cluster whi... 0.536148
230 Cluster upgrade.[sig-imageregistry] Image regi... 0.410038
234 Cluster upgrade.[sig-network-edge] Application... 0.404112
236 Cluster upgrade.[sig-network-edge] Cluster fro... 0.386168
203 Cluster upgrade.[sig-api-machinery] Kubernetes... 0.279356
[12]
# Overall flake percentage
df.flake.sum() / df.flake.count()
0.001825063064298831

Save results to Ceph or locally

  • Use the following helper function to save the data frame in a parquet format on the Ceph bucket if we are running in automation, and locally if not.
[13]
filename = f"{METRIC_NAME}-{timestamp.year}-{timestamp.month}-{timestamp.day}.parquet"

if AUTOMATION == "True":
    cc = CephCommunication(s3_endpoint_url, s3_access_key, s3_secret_key, s3_bucket)
    cc.upload_to_ceph(df, metric_path, filename)
else:
    save_to_disk(df, OUTPUT_DATA_PATH, filename)
[14]
## Sanity check to see if the dataset is the same
if AUTOMATION == "True":
    sanity_check = cc.read_from_ceph(metric_path, filename)
else:
    sanity_check = read_from_disk(OUTPUT_DATA_PATH, filename)

sanity_check
timestamp tab grid test flake
0 2021-03-15 23:40:20 "redhat-assisted-installer" periodic-ci-openshift-release-master-nightly-4... Overall False
1 2021-03-15 00:01:06 "redhat-assisted-installer" periodic-ci-openshift-release-master-nightly-4... Overall False
2 2021-03-13 20:51:32 "redhat-assisted-installer" periodic-ci-openshift-release-master-nightly-4... Overall False
3 2021-03-13 07:51:20 "redhat-assisted-installer" periodic-ci-openshift-release-master-nightly-4... Overall False
4 2021-03-13 06:43:20 "redhat-assisted-installer" periodic-ci-openshift-release-master-nightly-4... Overall False
... ... ... ... ... ...
27192480 2021-03-14 00:01:00 "redhat-single-node" periodic-ci-openshift-release-master-nightly-4... openshift-tests.[sig-arch] Monitor cluster whi... True
27192481 2021-03-13 00:01:07 "redhat-single-node" periodic-ci-openshift-release-master-nightly-4... openshift-tests.[sig-arch] Monitor cluster whi... True
27192482 2021-03-12 04:22:20 "redhat-single-node" periodic-ci-openshift-release-master-nightly-4... openshift-tests.[sig-arch] Monitor cluster whi... True
27192483 2021-03-11 00:01:18 "redhat-single-node" periodic-ci-openshift-release-master-nightly-4... openshift-tests.[sig-arch] Monitor cluster whi... False
27192484 2021-03-10 00:01:03 "redhat-single-node" periodic-ci-openshift-release-master-nightly-4... openshift-tests.[sig-arch] Monitor cluster whi... False

27192485 rows × 5 columns

Conclusion

This notebook computed number of flakes and the flake severity metric. The dataframe saved on ceph can be used to generate aggregated views and visualizations.