ODH Logo

Prow Logs and GCS Data

What do we have access to as data scientists when digging into the build artifacts?

In this notebook we will demonstrate how to discover and interact with the data (logs) made availble on GCS/origin-ci-test as well as provide some simple EDA to help folks get started analyzing this data.

This notebook is divided into 2 sections:

  1. Compare the different log files present throughout the archives and quantify how complete and comparable our log dataset is from build to build.
  2. Download a sample dataset of the events and build logs to perform some lite EDA.

Note: We will be collecting data from the "origin-ci-test" Bucket on Google Cloud Storage. But, after some out-of-notebook exploration it has become aparent that this is a massive amount of data that contains more than just the OpenShift CI logs we are intrested in here and programatically investigating that Bucket is not advised. Therefore, we recommend using the web ui to inspect what jobs are exposed and identify what is of interest to your analysis before collecting data via the google cloud stporage api. Here we will rely on web-scraping the UI to explore what's available to us based on what jobs are displayed on TestGrid.

Compare availability of log files across a build

[1]
import json
import requests
import pandas as pd
from bs4 import BeautifulSoup
from google.cloud import storage
from pathlib import Path
from IPython.display import Markdown

Example to access a single set of Prow artifacts

Let's make sure we understand how this works, and focus on a single job first.

[2]
tab = '"redhat-openshift-ocp-release-4.6-informing"'
job = "periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-gcp-upgrade"
[3]
response = requests.get(
    f"https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/{job}"
)
soup = BeautifulSoup(response.text, "html.parser")
list_of_builds = [x.get_text()[1:-1] for x in soup.find_all("a")][1:-1]
[4]
response = requests.get(
    f"https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/{job}/{list_of_builds[1]}"
)
response.url
'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-gcp-upgrade/1383601003814719488'
[5]
soup = BeautifulSoup(response.text, "html.parser")
[6]
[x.get_text() for x in soup.find_all("a")]
[' ..',
 ' artifacts/',
 ' build-log.txt',
 ' finished.json',
 ' podinfo.json',
 ' prowjob.json',
 ' started.json']

Great, we can now programmatically access the archives. Now, lets walk through all of the build archives for a single job and create a list of what they have on the first level of their directories.

[7]
build_data = {}

for build in list_of_builds:
    response = requests.get(
        f"https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/{job}/{build}"
    )
    soup = BeautifulSoup(response.text, "html.parser")
    artifacts = [x.get_text() for x in soup.find_all("a")]
    build_data[build] = artifacts
[8]
builds_info = pd.Series({k: len(v) for (k, v) in build_data.items()})
[9]
builds_info.value_counts()
7    536
dtype: int64
[10]
pd.Series(build_data).apply(" ".join).value_counts()
 ..  artifacts/  build-log.txt  finished.json  podinfo.json  prowjob.json  started.json    536
dtype: int64
[11]
cent1 = builds_info.value_counts() / len(builds_info)
cent1 = cent1.to_frame().reset_index()
cent1
index 0
0 7 1.0
[12]
Markdown(
    "{}% of our records for this job appear to be complete and include the 'artifacts/' subdirectory,\
 lets dig in and see what they contain.".format(
        round(cent1.iloc[0, 1] * 100, 2)
    )
)

100.0% of our records for this job appear to be complete and include the 'artifacts/' subdirectory, lets dig in and see what they contain.

[13]
build_data = {}

for build in list_of_builds:
    response = requests.get(
        f"https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/{job}/{build}/artifacts"
    )
    soup = BeautifulSoup(response.text, "html.parser")
    artifacts = [x.get_text() for x in soup.find_all("a")]
    build_data[build] = artifacts
[14]
artifact_info = pd.Series({k: len(v) for (k, v) in build_data.items()})
artifact_info.value_counts()
8    528
7      4
3      2
6      1
5      1
dtype: int64
[15]
cent2 = artifact_info.value_counts() / len(artifact_info)
cent2 = cent2.to_frame().reset_index()
cent2
index 0
0 8 0.985075
1 7 0.007463
2 3 0.003731
3 6 0.001866
4 5 0.001866
[16]
Markdown(
    "The above shows us that there are about {}% of the artifacts dirs "
    "that have {} items and {}% that have {} items and so on "
    "(but it does not account for different combinations).".format(
        round(cent2.iloc[0, 1] * 100, 2),
        cent2.iloc[0, 0],
        round(cent2.iloc[1, 1] * 100, 2),
        cent2.iloc[1, 0],
    )
)

The above shows us that there are about 98.51% of the artifacts dirs that have 8 items and 0.75% that have 7 items and so on (but it does not account for different combinations).

[17]
pd.Series(build_data).apply(" ".join).value_counts(normalize=True)
 ..  build-resources/  e2e-gcp-upgrade/  release/  ci-operator-step-graph.json  ci-operator.log  junit_operator.xml  metadata.json    0.985075
 ..  build-resources/  release/  ci-operator-step-graph.json  ci-operator.log  junit_operator.xml  metadata.json                      0.007463
 ..  ci-operator.log  junit_job.xml                                                                                                   0.003731
 ..  build-resources/  ci-operator-step-graph.json  ci-operator.log  junit_operator.xml  metadata.json                                0.001866
 ..  e2e-gcp-upgrade/  release/  ci-operator.log  metadata.json                                                                       0.001866
dtype: float64

We can see from the results above that once we get down into the artifacts there is a far less uniformity to the data available to us for analysis. And this is all within a single job! Moving forward we will assume that this issue gets worse when comparing available artifacts across jobs and can dedicate a later notebook to proving out that assumption.

This heterogeneity of objects available for each build will make it somewhat difficult to use these sets of documents as a whole to compare different CI behaviour. At this point, it makes sense to consider looking only at the same document (log) across job where available.

Collect Data

Build logs

In the next section we are going to walkthrough accessing the build-logs.txt and the events.json as they appear to be nearly universally available. We will both download a small testing dataset as well show how to work directly with the data in memory.

Now that we know what logs we want to collect its simpler to use the google cloud storage api to access or data.

[18]
def connect_storage(bucket_name):
    storage_client = storage.Client.create_anonymous_client()
    bucket = storage_client.bucket(bucket_name)
    return {"bucket": bucket, "storage_client": storage_client}


def download_public_file(client, source_blob_name):
    """Downloads a public blob from the bucket."""
    blob = client["bucket"].blob(source_blob_name)
    if blob.exists(client["storage_client"]):
        text = blob.download_as_text()
    else:
        text = ""
    return text
[19]
bucket_connection = connect_storage("origin-ci-test")
[20]
# Read data into memory
build_log_data = {}
for build in list_of_builds:
    file = download_public_file(bucket_connection, f"logs/{job}/{build}/build-log.txt")
    build_log_data[build] = file
[21]
build_log_data[list(build_log_data.keys())[0]]
'\x1b[36mINFO\x1b[0m[2021-04-17T19:58:56Z] ci-operator version v20210416-d902916        \n\x1b[36mINFO\x1b[0m[2021-04-17T19:58:56Z] Loading configuration from https://config.ci.openshift.org for openshift/release@master [ci-4.6-upgrade-from-stable-4.5] \n\x1b[36mINFO\x1b[0m[2021-04-17T19:58:57Z] Resolved SHA missing for master in https://github.com/openshift/release (will prevent caching) \n\x1b[36mINFO\x1b[0m[2021-04-17T19:58:57Z] Using explicitly provided pull-spec for release initial (registry.ci.openshift.org/ocp/release:4.5.37) \n\x1b[36mINFO\x1b[0m[2021-04-17T19:58:57Z] Using explicitly provided pull-spec for release latest (registry.ci.openshift.org/ocp/release:4.6.0-0.ci-2021-04-17-163254) \n\x1b[36mINFO\x1b[0m[2021-04-17T19:58:57Z] Using namespace https://console.build02.ci.openshift.org/k8s/cluster/projects/ci-op-y3cf82zn \n\x1b[36mINFO\x1b[0m[2021-04-17T19:58:57Z] Running [input:origin-centos-8], [input:ocp-4.5-upi-installer], [release:initial], [release:latest], [images], e2e-gcp-upgrade \n\x1b[36mINFO\x1b[0m[2021-04-17T19:58:58Z] Tagging origin/centos:8 into pipeline:origin-centos-8. \n\x1b[36mINFO\x1b[0m[2021-04-17T19:58:58Z] Tagging ocp/4.5:upi-installer into pipeline:ocp-4.5-upi-installer. \n\x1b[36mINFO\x1b[0m[2021-04-17T19:58:59Z] Importing release image initial.             \n\x1b[36mINFO\x1b[0m[2021-04-17T19:58:59Z] Importing release image latest.              \n\x1b[36mINFO\x1b[0m[2021-04-17T20:01:08Z] Imported release 4.5.37 created at 2021-03-31 06:05:56 +0000 UTC with 108 images to tag release:initial \n\x1b[36mINFO\x1b[0m[2021-04-17T20:01:16Z] Imported release 4.6.0-0.ci-2021-04-17-163254 created at 2021-04-17 16:33:19 +0000 UTC with 121 images to tag release:latest \n\x1b[36mINFO\x1b[0m[2021-04-17T20:01:16Z] Acquiring leases for test e2e-gcp-upgrade    \n\x1b[36mINFO\x1b[0m[2021-04-17T20:01:16Z] Running multi-stage test e2e-gcp-upgrade     \n\x1b[36mINFO\x1b[0m[2021-04-17T20:01:17Z] Running step e2e-gcp-upgrade-ipi-install-hosted-loki. \n\x1b[36mINFO\x1b[0m[2021-04-17T20:01:27Z] Step e2e-gcp-upgrade-ipi-install-hosted-loki succeeded after 10.064135461s. \n\x1b[36mINFO\x1b[0m[2021-04-17T20:01:27Z] Running step e2e-gcp-upgrade-ipi-conf.       \n\x1b[36mINFO\x1b[0m[2021-04-17T20:01:47Z] Step e2e-gcp-upgrade-ipi-conf succeeded after 20.05838057s. \n\x1b[36mINFO\x1b[0m[2021-04-17T20:01:47Z] Running step e2e-gcp-upgrade-ipi-conf-gcp.   \n\x1b[36mINFO\x1b[0m[2021-04-17T20:01:57Z] Step e2e-gcp-upgrade-ipi-conf-gcp succeeded after 10.065333292s. \n\x1b[36mINFO\x1b[0m[2021-04-17T20:01:57Z] Running step e2e-gcp-upgrade-ipi-install-monitoringpvc. \n\x1b[36mINFO\x1b[0m[2021-04-17T20:02:07Z] Step e2e-gcp-upgrade-ipi-install-monitoringpvc succeeded after 10.071902479s. \n\x1b[36mINFO\x1b[0m[2021-04-17T20:02:07Z] Running step e2e-gcp-upgrade-ipi-install-rbac. \n\x1b[36mINFO\x1b[0m[2021-04-17T20:02:17Z] Step e2e-gcp-upgrade-ipi-install-rbac succeeded after 10.051469409s. \n\x1b[36mINFO\x1b[0m[2021-04-17T20:02:17Z] Running step e2e-gcp-upgrade-ipi-install-install-stableinitial. \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] Logs for container test in pod e2e-gcp-upgrade-ipi-install-install-stableinitial: \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] Installing from release registry.build02.ci.openshift.org/ci-op-y3cf82zn/release@sha256:037ed15c79bbd6bee6f0d524d91790d2b368cba945e7ae6d64c984dc0f162b82 \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] level=info msg="Credentials loaded from environment variable \\"GOOGLE_CLOUD_KEYFILE_JSON\\", file \\"/var/run/secrets/ci.openshift.io/cluster-profile/gce.json\\"" \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] level=info msg="Consuming Install Config from target directory" \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] Will include manifests:                      \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] /tmp/secret/manifest_metrics_rb.yml          \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] /tmp/secret/manifest_creds.yml               \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] /tmp/secret/manifest_ds.yml                  \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] /tmp/secret/manifest_cm.yml                  \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] /tmp/secret/manifest_clusterrole.yml         \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] /tmp/secret/manifest_oauth_role.yml          \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] /tmp/secret/manifest_metrics_role.yml        \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] /tmp/secret/manifest_promtail_service.yml    \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] /tmp/secret/manifest_grafanacom_creds.yml    \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] /tmp/secret/manifest_sa.yml                  \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] /tmp/secret/manifest_promtail_cookie_secret.yml \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] /tmp/secret/manifest_role.yml                \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] /tmp/secret/manifest_cluster-monitoring-pvc.yml \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] /tmp/secret/manifest_psp.yml                 \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] /tmp/secret/manifest_01_ns.yml               \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] /tmp/secret/manifest_oauth_clusterrolebinding.yml \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] /tmp/secret/manifest_metrics.yml             \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] /tmp/secret/manifest_rolebinding.yml         \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] /tmp/secret/manifest_clusterrolebinding.yml  \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] level=warning msg="Found override for release image. Please be warned, this is not advised" \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] level=info msg="Credentials loaded from environment variable \\"GOOGLE_CLOUD_KEYFILE_JSON\\", file \\"/var/run/secrets/ci.openshift.io/cluster-profile/gce.json\\"" \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] level=info msg="Consuming Openshift Manifests from target directory" \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] level=info msg="Consuming Master Machines from target directory" \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] level=info msg="Consuming OpenShift Install (Manifests) from target directory" \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] level=info msg="Consuming Worker Machines from target directory" \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] level=info msg="Consuming Common Manifests from target directory" \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] level=info msg="Creating infrastructure resources..." \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] level=info msg="Waiting up to 20m0s for the Kubernetes API at https://api.ci-op-y3cf82zn-5a3d5.origin-ci-int-gce.dev.openshift.com:6443..." \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] level=info msg="API v1.18.3+cdb0358 up"      \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] level=info msg="Waiting up to 40m0s for bootstrapping to complete..." \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] level=error msg="Cluster operator authentication Degraded is True with ConfigObservation_Error::IngressStateEndpoints_MissingEndpoints::RouteStatus_FailedHost::RouterCerts_NoRouterCertSecret: RouterCertsDegraded: secret/v4-0-config-system-router-certs -n openshift-authentication: could not be retrieved: secret \\"v4-0-config-system-router-certs\\" not found\\nConfigObservationDegraded: secret \\"v4-0-config-system-router-certs\\" not found\\nIngressStateEndpointsDegraded: No endpoints found for oauth-server\\nRouteStatusDegraded: route is not available at canonical host oauth-openshift.apps.ci-op-y3cf82zn-5a3d5.origin-ci-int-gce.dev.openshift.com: []" \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] level=info msg="Cluster operator authentication Progressing is Unknown with NoData: " \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] level=info msg="Cluster operator authentication Available is Unknown with NoData: " \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] level=info msg="Cluster operator csi-snapshot-controller Progressing is True with _AsExpected: Progressing: Waiting for Deployment to deploy csi-snapshot-controller pods" \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] level=info msg="Cluster operator csi-snapshot-controller Available is False with _AsExpected: Available: Waiting for Deployment to deploy csi-snapshot-controller pods" \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] level=info msg="Cluster operator kube-storage-version-migrator Available is False with _NoMigratorPod: Available: deployment/migrator.openshift-kube-storage-version-migrator: no replicas are available" \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] level=info msg="Cluster operator machine-config Progressing is True with : Working towards 4.5.37" \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] level=error msg="Cluster operator machine-config Degraded is True with RenderConfigFailed: Unable to apply 4.5.37: openshift-config-managed/kube-cloud-config configmap is required on platform GCP but not found: configmap \\"kube-cloud-config\\" not found" \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] level=info msg="Cluster operator machine-config Available is False with : Cluster not available for 4.5.37" \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] level=info msg="Cluster operator openshift-apiserver Available is False with APIServices_Error: APIServicesAvailable: \\"apps.openshift.io.v1\\" is not ready: 503 (the server is currently unable to handle the request)\\nAPIServicesAvailable: \\"authorization.openshift.io.v1\\" is not ready: 503 (the server is currently unable to handle the request)\\nAPIServicesAvailable: \\"build.openshift.io.v1\\" is not ready: 503 (the server is currently unable to handle the request)\\nAPIServicesAvailable: \\"image.openshift.io.v1\\" is not ready: 503 (the server is currently unable to handle the request)\\nAPIServicesAvailable: \\"oauth.openshift.io.v1\\" is not ready: 503 (the server is currently unable to handle the request)\\nAPIServicesAvailable: \\"project.openshift.io.v1\\" is not ready: 503 (the server is currently unable to handle the request)\\nAPIServicesAvailable: \\"quota.openshift.io.v1\\" is not ready: 503 (the server is currently unable to handle the request)\\nAPIServicesAvailable: \\"route.openshift.io.v1\\" is not ready: 503 (the server is currently unable to handle the request)\\nAPIServicesAvailable: \\"security.openshift.io.v1\\" is not ready: 503 (the server is currently unable to handle the request)\\nAPIServicesAvailable: \\"template.openshift.io.v1\\" is not ready: 503 (the server is currently unable to handle the request)\\nAPIServicesAvailable: \\"user.openshift.io.v1\\" is not ready: 503 (the server is currently unable to handle the request)" \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] level=info msg="Cluster operator openshift-controller-manager Progressing is True with _DesiredStateNotYetAchieved: Progressing: daemonset/controller-manager: observed generation is 0, desired generation is 8.\\nProgressing: daemonset/controller-manager: number available is 0, desired number available > 1" \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] level=info msg="Cluster operator openshift-controller-manager Available is False with _NoPodsAvailable: Available: no daemon pods available on any node." \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] level=info msg="Cluster operator operator-lifecycle-manager-packageserver Available is False with : " \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] level=info msg="Cluster operator operator-lifecycle-manager-packageserver Progressing is True with : Working toward 0.15.1" \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] level=info msg="Pulling debug logs from the bootstrap machine" \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] level=info msg="Bootstrap gather logs captured here \\"/tmp/installer/log-bundle-20210417204954.tar.gz\\"" \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] level=fatal msg="Bootstrap failed to complete: failed to wait for bootstrapping to complete: timed out waiting for the condition" \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] Setup phase finished, prepare env for next steps \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] Copying log bundle...                        \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] Removing REDACTED info from log...           \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] Copying required artifacts to shared dir     \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] {"component":"entrypoint","error":"wrapped process failed: exit status 1","file":"prow/entrypoint/run.go:80","func":"k8s.io/test-infra/prow/entrypoint.Options.Run","level":"error","msg":"Error executing test process","severity":"error","time":"2021-04-17T20:50:28Z"} \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] error: failed to execute wrapped command: exit status 1 \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z]                                              \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] Step e2e-gcp-upgrade-ipi-install-install-stableinitial failed after 48m20.074374705s. \n\x1b[36mINFO\x1b[0m[2021-04-17T20:50:37Z] Running step e2e-gcp-upgrade-gather-gcp-console. \n\x1b[36mINFO\x1b[0m[2021-04-17T20:51:07Z] Step e2e-gcp-upgrade-gather-gcp-console succeeded after 30.058654549s. \n\x1b[36mINFO\x1b[0m[2021-04-17T20:51:07Z] Running step e2e-gcp-upgrade-gather-must-gather. \n\x1b[36mINFO\x1b[0m[2021-04-17T21:10:07Z] Step e2e-gcp-upgrade-gather-must-gather succeeded after 19m0.055587687s. \n\x1b[36mINFO\x1b[0m[2021-04-17T21:10:07Z] Running step e2e-gcp-upgrade-gather-extra.   \n\x1b[36mINFO\x1b[0m[2021-04-17T21:11:07Z] Step e2e-gcp-upgrade-gather-extra succeeded after 1m0.064604654s. \n\x1b[36mINFO\x1b[0m[2021-04-17T21:11:07Z] Running step e2e-gcp-upgrade-gather-audit-logs. \n\x1b[36mINFO\x1b[0m[2021-04-17T21:11:57Z] Step e2e-gcp-upgrade-gather-audit-logs succeeded after 50.057797698s. \n\x1b[36mINFO\x1b[0m[2021-04-17T21:11:57Z] Running step e2e-gcp-upgrade-ipi-deprovision-deprovision. \n\x1b[36mINFO\x1b[0m[2021-04-17T21:14:47Z] Step e2e-gcp-upgrade-ipi-deprovision-deprovision succeeded after 2m50.060881266s. \n\x1b[36mINFO\x1b[0m[2021-04-17T21:14:47Z] Releasing leases for test e2e-gcp-upgrade    \n\x1b[36mINFO\x1b[0m[2021-04-17T21:14:48Z] Ran for 1h15m51s                             \n\x1b[31mERRO\x1b[0m[2021-04-17T21:14:48Z] Some steps failed:                           \n\x1b[31mERRO\x1b[0m[2021-04-17T21:14:48Z]   * could not run steps: step e2e-gcp-upgrade failed: "e2e-gcp-upgrade" pre steps failed: "e2e-gcp-upgrade" pod "e2e-gcp-upgrade-ipi-install-install-stableinitial" failed: the pod ci-op-y3cf82zn/e2e-gcp-upgrade-ipi-install-install-stableinitial failed after 48m15s (failed containers: test): ContainerFailed one or more containers exited \n\x1b[31mERRO\x1b[0m[2021-04-17T21:14:48Z] Container test exited with code 1, reason Error \n\x1b[31mERRO\x1b[0m[2021-04-17T21:14:48Z] ---                                          \n\x1b[31mERRO\x1b[0m[2021-04-17T21:14:48Z] \\"quota.openshift.io.v1\\" is not ready: 503 (the server is currently unable to handle the request)\\nAPIServicesAvailable: \\"route.openshift.io.v1\\" is not ready: 503 (the server is currently unable to handle the request)\\nAPIServicesAvailable: \\"security.openshift.io.v1\\" is not ready: 503 (the server is currently unable to handle the request)\\nAPIServicesAvailable: \\"template.openshift.io.v1\\" is not ready: 503 (the server is currently unable to handle the request)\\nAPIServicesAvailable: \\"user.openshift.io.v1\\" is not ready: 503 (the server is currently unable to handle the request)" \n\x1b[31mERRO\x1b[0m[2021-04-17T21:14:48Z] level=info msg="Cluster operator openshift-controller-manager Progressing is True with _DesiredStateNotYetAchieved: Progressing: daemonset/controller-manager: observed generation is 0, desired generation is 8.\\nProgressing: daemonset/controller-manager: number available is 0, desired number available > 1" \n\x1b[31mERRO\x1b[0m[2021-04-17T21:14:48Z] level=info msg="Cluster operator openshift-controller-manager Available is False with _NoPodsAvailable: Available: no daemon pods available on any node." \n\x1b[31mERRO\x1b[0m[2021-04-17T21:14:48Z] level=info msg="Cluster operator operator-lifecycle-manager-packageserver Available is False with : " \n\x1b[31mERRO\x1b[0m[2021-04-17T21:14:48Z] level=info msg="Cluster operator operator-lifecycle-manager-packageserver Progressing is True with : Working toward 0.15.1" \n\x1b[31mERRO\x1b[0m[2021-04-17T21:14:48Z] level=info msg="Pulling debug logs from the bootstrap machine" \n\x1b[31mERRO\x1b[0m[2021-04-17T21:14:48Z] level=info msg="Bootstrap gather logs captured here \\"/tmp/installer/log-bundle-20210417204954.tar.gz\\"" \n\x1b[31mERRO\x1b[0m[2021-04-17T21:14:48Z] level=fatal msg="Bootstrap failed to complete: failed to wait for bootstrapping to complete: timed out waiting for the condition" \n\x1b[31mERRO\x1b[0m[2021-04-17T21:14:48Z] Setup phase finished, prepare env for next steps \n\x1b[31mERRO\x1b[0m[2021-04-17T21:14:48Z] Copying log bundle...                        \n\x1b[31mERRO\x1b[0m[2021-04-17T21:14:48Z] Removing REDACTED info from log...           \n\x1b[31mERRO\x1b[0m[2021-04-17T21:14:48Z] Copying required artifacts to shared dir     \n\x1b[31mERRO\x1b[0m[2021-04-17T21:14:48Z] {"component":"entrypoint","error":"wrapped process failed: exit status 1","file":"prow/entrypoint/run.go:80","func":"k8s.io/test-infra/prow/entrypoint.Options.Run","level":"error","msg":"Error executing test process","severity":"error","time":"2021-04-17T20:50:28Z"} \n\x1b[31mERRO\x1b[0m[2021-04-17T21:14:48Z] error: failed to execute wrapped command: exit status 1 \n\x1b[31mERRO\x1b[0m[2021-04-17T21:14:48Z] ---                                          \n\x1b[31mERRO\x1b[0m[2021-04-17T21:14:48Z] Link to step on registry info site: https://steps.ci.openshift.org/reference/ipi-install-install-stableinitial \n\x1b[31mERRO\x1b[0m[2021-04-17T21:14:48Z] Link to job on registry info site: https://steps.ci.openshift.org/job?org=openshift&repo=release&branch=master&test=e2e-gcp-upgrade&variant=ci-4.6-upgrade-from-stable-4.5 \n'
[22]
def get_counts(x):
    """
    Gets counts for chars, words, lines for a log.
    """
    if x:
        chars = len(x)
        words = len(x.split())
        lines = x.count("\n") + 1
        return chars, words, lines
    else:
        return 0, 0, 0
[23]
## Create a dataframe with char, words, and lines
## count for the logs
data = []
for key, value in build_log_data.items():
    chars, words, lines = get_counts(value)
    data.append([key, chars, words, lines])

df = pd.DataFrame(data=data, columns=["build_log_id", "chars", "words", "lines"])
df
build_log_id chars words lines
0 1383510269677801472 17065 1188 120
1 1383601003814719488 17063 1189 120
2 1383691728313126912 17063 1189 120
3 1383782446109036544 37601 2690 261
4 1383873157873537024 16784 1167 120
... ... ... ... ...
531 1447844674805960704 20401 1485 223
532 1448025872249393152 20395 1480 222
533 1448207267424899072 20353 1477 223
534 1448388543654137856 20443 1484 222
535 1448569878565883904 20458 1486 222

536 rows × 4 columns

See the stats for chars, words, lines

[24]
df["chars"].describe()
count    5.360000e+02
mean     3.516793e+04
std      2.711030e+05
min      2.120000e+03
25%      2.041650e+04
50%      2.052900e+04
75%      2.809525e+04
max      6.289343e+06
Name: chars, dtype: float64
[25]
df["words"].describe()
count       536.000000
mean       2382.583955
std       19301.541247
min         147.000000
25%        1482.000000
50%        1494.000000
75%        1648.000000
max      447981.000000
Name: words, dtype: float64
[26]
describe = df["lines"].describe()
describe = describe.to_frame()
describe
lines
count 536.000000
mean 249.042910
std 815.830161
min 19.000000
25% 215.000000
50% 222.000000
75% 222.000000
max 19034.000000
[27]
Markdown(
    "From the initial analysis above, we see that we have log files "
    " with {} lines to {} lines with a mean of ~{} lines. "
    "This suggest high variability. The next thing we could look "
    " at would be the similarity between log files, "
    "performing analysis, templating, and clustering. "
    " We will address those questions in an upcoming notebook.".format(
        round(describe.iloc[3, 0]),
        round(describe.iloc[7, 0]),
        round(describe.iloc[1, 0]),
    )
)

From the initial analysis above, we see that we have log files with 19 lines to 19034 lines with a mean of ~249 lines. This suggest high variability. The next thing we could look at would be the similarity between log files, performing analysis, templating, and clustering. We will address those questions in an upcoming notebook.

Events

[28]
build_events_data = {}
for build in list_of_builds:
    file = download_public_file(
        bucket_connection, f"logs/{job}/{build}/artifacts/build-resources/events.json"
    )
    if file:
        build_events_data[build] = json.loads(file)
    else:
        build_events_data[build] = ""
[29]
## Percentage of builds that have the events.json file
count = 0
for key, value in build_events_data.items():
    if value:
        count += 1
percent = count * 100 / len(build_events_data)
print(percent)
99.44029850746269
[30]
# Analyzing the messages of a single build
single_build = sorted(build_events_data.keys())[0]
messages = [
    (i["metadata"]["uid"], i["message"])
    for i in build_events_data[single_build]["items"]
]
messages_df = pd.DataFrame(messages, columns=["UID", "message"])
messages_df
UID message
0 c694dae6-d31b-469c-b9ca-e139d20ef644 Running job periodic-ci-openshift-release-mast...
1 8174d743-feb7-4446-84f5-def97617fd3c Running job periodic-ci-openshift-release-mast...
2 e7545ab7-fa0a-473f-b719-4dbad5e01a65 created SCC ranges
3 8a49821c-3ef6-4f71-ae64-a3faad6d8365 No matching pods found
4 b97c3b80-4e2f-4e5e-8e4a-5e890c3cc4fd found no controller ref for pod "release-image...
... ... ...
219 6e002c16-f8cc-4390-82cb-a6e7f521541e Created container release
220 e0592c0a-de8b-4ab8-9aee-5fe0384dd51f Started container release
221 946cf37e-9ce7-436a-8997-cd88a3ffeea8 Container image "gcr.io/k8s-prow/sidecar:v2021...
222 90fa202b-5ac5-4768-86ad-54767e6579d4 Created container sidecar
223 9d81659f-c5a9-4446-b86e-ddcc9e590dec Started container sidecar

224 rows × 2 columns

[31]
messages_df["message"].describe()
count                           224
unique                           89
top       Created container sidecar
freq                             13
Name: message, dtype: object
[32]
messages_df["message"].value_counts().reset_index()
index message
0 Created container sidecar 13
1 Started container place-entrypoint 13
2 Started container sidecar 13
3 Container image "gcr.io/k8s-prow/sidecar:v2021... 13
4 Container image "gcr.io/k8s-prow/entrypoint:v2... 13
... ... ...
84 Add eth0 [10.130.47.13/23] 1
85 Successfully assigned ci-op-y3cf82zn/e2e-gcp-u... 1
86 Successfully pulled image "image-registry.open... 1
87 Pulling image "image-registry.openshift-image-... 1
88 Successfully pulled image "image-registry.open... 1

89 rows × 2 columns

[33]
Markdown(
    "In the build data, we saw that about {}% builds have the events.json file."
    " We further analyzed all the events that happened for a particular build"
    " and found the frequencies of the messages. We can repeat the process for"
    " all the other builds and find most common messages and perform further"
    " analysis.".format(round(percent, 2))
)

In the build data, we saw that about 99.44% builds have the events.json file. We further analyzed all the events that happened for a particular build and found the frequencies of the messages. We can repeat the process for all the other builds and find most common messages and perform further analysis.

Save sample data

[34]
path = "../../../data/raw/gcs/build-logs/"
filename = "sample-build-logs.parquet"
dataset_base_path = Path(path)
dataset_base_path.mkdir(parents=True, exist_ok=True)
build_logs = pd.DataFrame.from_dict(build_log_data, orient="index", columns=["log"])
build_logs.to_parquet(f"{path}/{filename}")
[35]
path = "../../../data/raw/gcs/events/"
filename = "sample-events.json"
dataset_base_path = Path(path)
dataset_base_path.mkdir(parents=True, exist_ok=True)

with open(f"{path}/{filename}", "w") as file:
    json.dump(build_events_data, file)

Conclusion

In this notebook, we demonstrated how to programmatically access the gcs openshift origins ci archives, pull specific logs types into our notebook for analysis and save them for later use.