ODH Logo

openshift-ci: build logs initial data collection and EDA

Automated log analysis normally relies on some treats that typical logs exhibit:

  • Similar messages are repeated over and over.
  • Messages are short, and contain mixed text and parameters.
  • Log files contain some structure (time stamps, severity levels, system/component identifiers) as well as free form textual data.

Log files from OpenShift CI do not always have these characteristics. A single log file can contain a wide diverstity of log entries, some of which can be long multi-line sequences (e.g. command output) and do not always have the same structure. Moreover, the content type, structure and format varies between jobs, even between executions of the same job over time.

On the other hand, CI logs offer other characteristics that traditional logs often don't:

  • Diversity of content (as explained above).
  • They have a beginning and and end.
  • We have repetitions at a whole log file level. I.e. the same job runs multiple times, which enables us to compare multiple versions of the same log file (in contrast with traditional log analysis, which can exploit multiple occurrences only at the log entry level).

Different jobs in OpenShif CI provide different log files and outputs, but there is one specific log output that they all have in common: the job's build log (build-log.txt). We start the log exploration with this type of log.

Download build log data

We start by accessing job log data. Initially we will get:

  • Build logs (build-log.txt) for jobs

  • Basic metadata about each build log:

    • Size (lenght) of the build log file
    • Duration of the job run
    • The result of the job, i.e. wether the log corresponds to a failed or passed execution

All that information is available in a Google Storage bucket called openshift-ci. Within that bucket, each CI job definition has matching prefixes for its generated artifacts.

For example, all the builds for the job release-openshift-ocp-installer-e2e-aws-serial-4.7 have the form:

/logs/release-openshift-ocp-installer-e2e-aws-serial-4.7/BUILD_ID/

and right after that prefix we will always find 3 blobs that will provide us the information we need:

  • build-log.txt, the build log itself
  • started.json with the start's timestamp
  • finished.json (after the job has fininshed) contains the end timestamp and the job result (success, failure, or error)

You can see an example build of that job here.

Beyond that common information, each job type will have their own set of job-specific artifacts available under the artifacts/ infix. This notebook does not explore job-specific logs or artifacts.

[1]
import json
import matplotlib.pyplot as plt
import pandas as pd
import os
import seaborn as sns
from google.cloud import storage

pd.options.display.float_format = "{:.2f}".format
sns.set()

The following class contains the logic for obtaining the logs and metadata that we are interested in:

[2]
# Getting infortation about all builds from the job

# TODO: add an option to limit the number of builds to work with


class Builds:

    # To fetch build metadata without downloading a copy of the log files
    # create an instance with refresh=False and call fetch_build_data

    project = "origin-ci-test"  # The Google Storage bucket
    logs_path = "../../../../data/raw/gcs/build-logs/"  # Where to save log files
    metadata_path = "../../../../data/raw/gcs/build-metadata/"  # Where to save metadata

    @staticmethod
    def log_path_for(build_id):
        return os.path.join(Builds.logs_path, f"{build_id}.txt")

    def __init__(self, job_name, refresh=False):
        self.job_name = job_name
        self.prefixes = set()
        self.latest_build_id = ""
        self.df = None
        self.refresh = refresh
        self.cache_file_name = os.path.join(
            self.metadata_path, f"{job_name}_build-logs.csv"
        )
        # Note: it's important that the job prefix ends in '/' as this is used as a
        # prefix for google storage client's list_blobs with '/' as the delimiter
        # FIXME: this prefix only works for periodics. Generalize to include pre/post-submit job types
        self.job_prefix = f"logs/{job_name}/"
        # Google storage client
        self.client = storage.client.Client.create_anonymous_client()
        self.bucket = self.client.bucket(self.project)

    def prow_url_for(self, build_id):
        return f"https://prow.ci.openshift.org/view/gcs/{self.project}/{self.job_prefix}{build_id}"

    def build_log_url_for(self, build_id):
        """Returns a public URL to download a build's build-log.txt"""
        blob = self.bucket.get_blob(f"{self.job_prefix}{build_id}/build-log.txt")
        return blob.public_url

    # from https://github.com/googleapis/google-cloud-python/issues/920
    def _fetch_build_prefixes(self):
        iterator = self.client.list_blobs(
            self.bucket, prefix=self.job_prefix, delimiter="/"
        )
        for page in iterator.pages:
            self.prefixes.update(page.prefixes)

    def fetch_build_data(self):
        """Downloads build log information from google storage.
        If "refresh" is set, it dowloads and saves a copy of the log file itself.
        Otherwise, it just gets the metadata.
        This method does not save metadata to a local file"""
        self.df = pd.DataFrame(columns=["result", "size", "start", "end"])

        print("Fetching information about builds of", self.job_name)
        if not self.prefixes:
            self._fetch_build_prefixes()

        num_builds = len(self.prefixes)
        for i, prefix in enumerate(self.prefixes, start=1):
            build_id = prefix.split("/")[-2]
            print(f"({i}/{num_builds}) {build_id}", end="\r")

            blob = self.bucket.get_blob(f"{prefix}finished.json")
            if not blob:
                # Build has not finished yet
                continue
            finished = json.loads(blob.download_as_bytes())

            blob = self.bucket.get_blob(f"{prefix}started.json")
            if not blob:
                # Build has not started yet
                continue
            started = json.loads(blob.download_as_bytes())

            blob = self.bucket.get_blob(f"{prefix}build-log.txt")
            if blob:
                length = blob.size
                path = Builds.log_path_for(build_id)
                if self.refresh and not os.path.exists(path):
                    # Save a copy of the log itself, unless we already have it
                    blob.download_to_filename(path)
            else:
                length = 0

            row = {
                "result": finished["result"],
                "size": length,
                "start": started["timestamp"],
                "end": finished["timestamp"],
            }
            self.df = self.df.append(pd.DataFrame(row, index=[build_id]))
            if build_id > self.latest_build_id:
                self.latest_build_id = build_id
        print("Latest build is", self.latest_build_id)

    def save_metadata(self, overwrite=False):
        """Saves a copy of the metadata to a local CSV file"""
        if os.path.exists(self.cache_file_name) and not overwrite:
            print(f"{self.cache_file_name} already exists")
            print("Use overwrite=True to overwrite")
            return
        if not self.have_metadata():
            # We don't have anything to save yet, fetch it
            self.fetch_build_data()
        print(f"Saving build data to {self.cache_file_name}")
        self.df.to_csv(self.cache_file_name)

    def load_metadata(self):
        """Loads the metadata DataFrame from a local CSV file"""
        print(f"Loading build data from {self.cache_file_name}")
        self.df = pd.read_csv(self.cache_file_name, index_col=0)
        self.latest_build_id = self.df.index.max()

    def have_metadata(self):
        return self.df is not None and not self.df.empty

    def get_metadata(self):
        """Returns a DataFrame with build metadata,
        loading it from a previous save or downloading it
        from google cloud"""
        if not self.have_metadata():
            if self.refresh:
                self.fetch_build_data()
            else:
                self.load_metadata()
        return self.df

Get data for a job

Let's pick a specific job to work with.

NOTE: pick only a periodic job for now. pre/post-submit jobs are not much different, but the code so far assumes a /logs prefix, and pre/post-submit jobs use e.g. /pr-logs.

[3]
job_name = "periodic-ci-openshift-release-master-ci-4.8-e2e-gcp"

There are 3 options to load the information about job builds:

  1. Download everything (metadata and log files) from google cloud storage.
  2. Download the metadata but do not dowload the logs themselves.
  3. Load metadata from a local CSV file that has been created on a previous download.

The local copy of the metadata can be saved by calling save_metadata. NOTE: any existing metadata copy will not be overwritten by defaut; call with overwrite=True to replace the existing cache.

[4]
# Option 1: Refresh (re-download) all data, including log files:
# builds = Builds(job_name, refresh=True)
# builds.save_metadata(overwrite=True)

# Option 2: Get fresh build metadata without downloading a copy of the logs
# builds = Builds(job_name, refresh=False)
# builds.fetch_build_data()
# builds.save_metadata(overwrite=True)

# Option 3: Load [cached] build information
builds = Builds(job_name)

Let's get all the metadata (result, duration, build-log length) of all builds of that job.

Note that if we set refresh=True above this will also download a copy of each build log locally.

[5]
df = builds.get_metadata()

# Remove builds that simply erroed
build_errors = df[df["result"] == "error"].index
df.drop(build_errors, inplace=True)

# From timestamps to job duration
df["duration"] = df["end"] - df["start"]
del df["start"]
del df["end"]

# A boolean version of the result
df["success"] = df["result"] == "SUCCESS"
del df["result"]

df
Loading build data from ../../../../data/raw/gcs/build-metadata/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp_build-logs.csv
size duration success
1368338379971760128 5846974 4432 False
1372646746210963456 9785 5111 True
1375709011390763008 9962 4435 True
1380281977160077312 3841 9649 True
1385025333723402240 3868 5436 True
... ... ... ...
1381826019446493184 3837 4778 True
1371068681810874368 9782 4804 True
1377999575511470080 12329 4855 True
1369750447904002048 6467331 4742 False
1382368829539094528 3840 4903 True

837 rows × 3 columns

Overview of the metadata

[6]
print(f"success: {df.success.sum()} PASS and {(~df.success).sum()} FAIL")
df.describe()
success: 479 PASS and 358 FAIL
size duration
count 837.00 837.00
mean 3223683.72 5214.82
std 5112869.50 1558.29
min 92.00 4.00
25% 3872.00 4460.00
50% 9883.00 4785.00
75% 7329272.00 5436.00
max 37896258.00 15294.00

And a graphical overview of the distributions:

[7]
fig, axs = plt.subplots(1, 3, figsize=(15, 5))
axs[0].hist(df["size"], log=True, bins=30)
axs[0].set_xlabel("build log length (bytes)")
axs[1].hist(df["duration"], bins=30)
axs[1].set_xlabel("job duration (seconds)")
results = [df.success.sum(), (~df.success).sum()]
axs[2].pie(results, autopct="%1.1f%%", labels=["Pass", "Fail"], colors=["green", "red"])
axs[2].set_xlabel("job result")
plt.show()

We can see that the build log sizes go from very short (a few bytes log) to quite long (a few megabytes). By looking at the median (50%) we see that most of them stay on the relatively small side, but the ones that get log do get significantly bigger. Note that the axis for log lenght are in a logarithmic scale, to better accomodate this.

We will find out why by looking at the relationship between the features:

[8]
plt.plot("size", "duration", "go", data=df[df.success], alpha=0.1, label="Pass")
plt.plot("size", "duration", "ro", data=df[~df.success], alpha=0.1, label="Fail")
plt.xscale("log")
plt.xlabel("build-log size (bytes)")
plt.ylabel("duration (seconds)")
plt.legend()
plt.show()

Here we can notice a few interesting things:

  • There is a strong correlation between a build's build-log size and its result.
  • Successful job runs have a very consistent build log length.
  • A build's duration also has a relationship with the results: a duration under a certain threshold indicates a failure before tests could start, e.g. install or infrastructure issues.

All in all, we can appreciate that it seems to be quite possible to classify build results based on their log size and duration.

Log samples

Now let's take a look at the contents of the build logs.

Latest build's head

This is the beginning of the latest build's build-log.txt:

[9]
print("Build URL:", builds.prow_url_for(builds.latest_build_id), "\n")

path = Builds.log_path_for(builds.latest_build_id)
with open(path) as f:
    lines = f.readlines(2000)

for line in lines:
    print(line, end="")
Build URL: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1387498929158885376 INFO[2021-04-28T20:08:36Z] ci-operator version v20210428-04f6984 INFO[2021-04-28T20:08:36Z] Loading configuration from https://config.ci.openshift.org for openshift/release@master [ci-4.8] INFO[2021-04-28T20:08:37Z] Resolved SHA missing for master in https://github.com/openshift/release (will prevent caching) INFO[2021-04-28T20:08:37Z] Resolved release initial to registry.ci.openshift.org/ocp/release:4.8.0-0.ci-2021-04-27-075557 INFO[2021-04-28T20:08:37Z] Using explicitly provided pull-spec for release latest (registry.ci.openshift.org/ocp/release:4.8.0-0.ci-2021-04-28-200224) INFO[2021-04-28T20:08:37Z] Using namespace https://console.build01.ci.openshift.org/k8s/cluster/projects/ci-op-ptfyqcy5 INFO[2021-04-28T20:08:37Z] Running [input:origin-centos-8], [input:ocp-4.5-upi-installer], [release:latest], [images], e2e-gcp INFO[2021-04-28T20:08:38Z] Tagging ocp/4.5:upi-installer into pipeline:ocp-4.5-upi-installer. INFO[2021-04-28T20:08:38Z] Tagging origin/centos:8 into pipeline:origin-centos-8. INFO[2021-04-28T20:08:38Z] Importing release image latest. INFO[2021-04-28T20:11:42Z] Imported release 4.8.0-0.ci-2021-04-28-200224 created at 2021-04-28 20:04:24 +0000 UTC with 140 images to tag release:latest INFO[2021-04-28T20:11:42Z] Acquiring leases for test e2e-gcp INFO[2021-04-28T20:11:42Z] Running multi-stage test e2e-gcp INFO[2021-04-28T20:11:42Z] Running step e2e-gcp-ipi-conf. INFO[2021-04-28T20:11:52Z] Step e2e-gcp-ipi-conf succeeded after 10s. INFO[2021-04-28T20:11:52Z] Running step e2e-gcp-ipi-conf-gcp. INFO[2021-04-28T20:12:02Z] Step e2e-gcp-ipi-conf-gcp succeeded after 10s. INFO[2021-04-28T20:12:02Z] Running step e2e-gcp-ipi-install-monitoringpvc. INFO[2021-04-28T20:12:12Z] Step e2e-gcp-ipi-install-monitoringpvc succeeded after 10s.

NOTE: pretty output with colored log level is a recent addition as of this writing (PR 1836, merged 2021-04-05). This change altered the formatting of all build logs (from the change onwards, of course; previous builds' logs have a different format). May this serve as an example of the type of diversity and unexpected output we need to handle.

Full log of a successful job

When we looked at the metadata, we saw that all the successful builds have a similar lenght. Let's look at one of them:

[10]
sample_pass = df[df.success].sample(random_state=1).index.item()

print("Full log of a successful build: ", sample_pass)
print("Build URL:", builds.prow_url_for(sample_pass))
print("Raw log URL:", builds.build_log_url_for(sample_pass), "\n")

path = Builds.log_path_for(sample_pass)
with open(path) as f:
    lines = f.readlines()

for line in lines:
    print(line, end="")
Full log of a successful build: 1378441280396201984 Build URL: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1378441280396201984 Raw log URL: https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1378441280396201984/build-log.txt time="2021-04-03T20:16:31Z" level=info msg="ci-operator version v20210402-7d9fcd2" time="2021-04-03T20:16:31Z" level=info msg="Loading configuration from https://config.ci.openshift.org for openshift/release@master [ci-4.8]" time="2021-04-03T20:16:31Z" level=info msg="Resolved source https://github.com/openshift/release to master@" time="2021-04-03T20:16:31Z" level=info msg="Resolved release initial to registry.ci.openshift.org/ocp/release:4.8.0-0.ci-2021-04-02-173758" time="2021-04-03T20:16:31Z" level=info msg="Using explicitly provided pull-spec for release latest (registry.ci.openshift.org/ocp/release:4.8.0-0.ci-2021-04-03-201542)" time="2021-04-03T20:16:32Z" level=info msg="Using namespace https://console.build02.ci.openshift.org/k8s/cluster/projects/ci-op-8ps2vgf0" time="2021-04-03T20:16:32Z" level=info msg="Running [input:origin-centos-8], [input:ocp-4.5-upi-installer], [release:latest], [images], e2e-gcp" time="2021-04-03T20:16:33Z" level=info msg="Tagging origin/centos:8 into pipeline:origin-centos-8." time="2021-04-03T20:16:33Z" level=info msg="Tagging ocp/4.5:upi-installer into pipeline:ocp-4.5-upi-installer." time="2021-04-03T20:16:33Z" level=info msg="Importing release image latest." time="2021-04-03T20:18:03Z" level=info msg="Imported release 4.8.0-0.ci-2021-04-03-201542 created at 2021-04-03 20:15:59 +0000 UTC with 137 images to tag release:latest" time="2021-04-03T20:18:03Z" level=info msg="Acquiring leases for test e2e-gcp" time="2021-04-03T20:18:03Z" level=info msg="Running multi-stage test e2e-gcp" time="2021-04-03T20:18:03Z" level=info msg="Running step e2e-***-ipi-conf." time="2021-04-03T20:18:13Z" level=info msg="Step e2e-***-ipi-conf succeeded after 10.043593091s." time="2021-04-03T20:18:13Z" level=info msg="Running step e2e-***-ipi-conf-***." time="2021-04-03T20:18:23Z" level=info msg="Step e2e-***-ipi-conf-*** succeeded after 10.06050712s." time="2021-04-03T20:18:23Z" level=info msg="Running step e2e-***-ipi-install-monitoringpvc." time="2021-04-03T20:18:33Z" level=info msg="Step e2e-***-ipi-install-monitoringpvc succeeded after 10.092816083s." time="2021-04-03T20:18:33Z" level=info msg="Running step e2e-***-ipi-install-rbac." time="2021-04-03T20:18:43Z" level=info msg="Step e2e-***-ipi-install-rbac succeeded after 10.064313278s." time="2021-04-03T20:18:43Z" level=info msg="Running step e2e-***-ipi-install-install." time="2021-04-03T20:50:53Z" level=info msg="Step e2e-***-ipi-install-install succeeded after 32m10.048191864s." time="2021-04-03T20:50:53Z" level=info msg="Running step e2e-***-openshift-e2e-test." time="2021-04-03T21:18:23Z" level=info msg="Step e2e-***-openshift-e2e-test succeeded after 27m30.068558431s." time="2021-04-03T21:18:23Z" level=info msg="Running step e2e-***-gather-core-dump." time="2021-04-03T21:18:43Z" level=info msg="Step e2e-***-gather-core-dump succeeded after 20.08649363s." time="2021-04-03T21:18:43Z" level=info msg="Running step e2e-***-gather-***-console." time="2021-04-03T21:19:34Z" level=info msg="Step e2e-***-gather-***-console succeeded after 50.054776703s." time="2021-04-03T21:19:34Z" level=info msg="Running step e2e-***-gather-must-gather." time="2021-04-03T21:23:04Z" level=info msg="Step e2e-***-gather-must-gather succeeded after 3m30.054931144s." time="2021-04-03T21:23:04Z" level=info msg="Running step e2e-***-gather-extra." time="2021-04-03T21:25:04Z" level=info msg="Step e2e-***-gather-extra succeeded after 2m0.055763349s." time="2021-04-03T21:25:04Z" level=info msg="Running step e2e-***-gather-audit-logs." time="2021-04-03T21:26:04Z" level=info msg="Step e2e-***-gather-audit-logs succeeded after 1m0.054968828s." time="2021-04-03T21:26:04Z" level=info msg="Running step e2e-***-ipi-deprovision-deprovision." time="2021-04-03T21:30:44Z" level=info msg="Step e2e-***-ipi-deprovision-deprovision succeeded after 4m40.050907857s." time="2021-04-03T21:30:44Z" level=info msg="Releasing leases for test e2e-***" time="2021-04-03T21:30:44Z" level=info msg="Ran for 1h14m12s" time="2021-04-03T21:30:44Z" level=info msg="Reporting job state 'succeeded'"

Successful jobs tend to have quite similar contents, essentially enumerating the steps that are being followed and reporting their success.

While they are all similar, note the differences in output format, and also format differences within the log file itself.

Failed job output

When there are failures, the log output of the step that failed is added to the build log. This is why we see the significant differences in log lenghts, and why logs from failed jobs tend to be longer. Or, if the problem occurs during the preparatory steps, we get shorter logs as we saw in the metadata overview.

Let's look at some failed builds' logs. Because these can be huge, we only look at a few lines around the middle of the log file, which is enough to notice that they contain different types of logs than a successful build; they also differ between each other.

NOTE: a representative sample was handpicked from builds I had downloaded. To reproduce this, just uncomment the line below that sets failed_buids to be a random sample of the data you currently have; re-running the cell a few times should show the diversity of these logs.

[11]
# A list of hand-picked represenative build IDs from failed job runs:
failed_builds = [  # Some custom build IDs representing log diversity
    1372555706749685760,  # infra error, short log, long duration, old format
    1380185997530107904,  # same type of error, new formatting
    1375113030177656832,  # mixed output structure
    1383399218122592256,  # new formatting, duplicate and inconsistent timestamps
    1375882831099596800,  # output from a must-gather
]

# No hand pick version: Get a sample of current failed builds
# IMPORTANT NOTE: ------------------------------------------------------------------
# IMPORTANT NOTE: uncomment next line to get a sample from currently available data:
# IMPORTANT NOTE: ------------------------------------------------------------------
# failed_builds = df[~df.success].sample(3).index

for build_id in failed_builds:
    print(f"\n>> BUILD {build_id}")
    print(">> BUILD URL:", builds.prow_url_for(build_id))
    print(">> RAW LOG URL:", builds.build_log_url_for(build_id), "\n")
    path = Builds.log_path_for(build_id)

    with open(path) as f:
        # We will look around the middle of the file
        length = df.loc[build_id]["size"].item()
        f.seek(length / 2)
        lines = f.readlines(2500)

    for line in lines[1:]:
        print(line, end="")
>> BUILD 1372555706749685760 >> BUILD URL: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1372555706749685760 >> RAW LOG URL: https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1372555706749685760/build-log.txt 2021/03/18 14:29:22 Setting up pipeline imagestream for the test 2021/03/18 14:29:22 Updated secret ci-pull-credentials 2021/03/18 14:29:22 Created secret e2e-gcp-cluster-profile 2021/03/18 14:29:22 Created PDB for pods with openshift.io/build.name label 2021/03/18 14:29:22 Created PDB for pods with created-by-ci label 2021/03/18 14:29:22 Tagging origin/centos:8 into pipeline:origin-centos-8 2021/03/18 14:29:22 Tagging ocp/4.5:upi-installer into pipeline:ocp-4.5-upi-installer 2021/03/18 14:29:22 Importing release image latest 2021/03/18 14:29:23 Executing pod "release-images-latest-cli" running image "release:latest" 2021/03/18 14:29:33 Executing pod "release-images-latest" running image "stable:cli" 2021/03/18 14:30:50 Imported release 4.8.0-0.ci-2021-03-18-142508 created at 2021-03-18 14:25:55 +0000 UTC with 132 images to tag release:latest 2021/03/18 14:30:50 Acquiring leases for "e2e-gcp" 2021/03/18 14:30:50 Acquiring 1 lease(s) for "gcp-quota-slice" 2021/03/18 16:30:50 error: Failed to acquire resource, current capacity: 0 free, 70 leased 2021/03/18 16:30:50 No custom metadata found and prow metadata already exists. Not updating the metadata. 2021/03/18 16:30:50 Ran for 2h1m28s error: some steps failed: * could not run steps: step e2e-gcp failed: failed to acquire lease: resources not found time="2021-03-18T16:30:50Z" level=info msg="Reporting job state 'failed' with reason 'executing_graph:step_failed:utilizing_lease'" >> BUILD 1380185997530107904 >> BUILD URL: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1380185997530107904 >> RAW LOG URL: https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1380185997530107904/build-log.txt INFO[2021-04-08T15:49:27Z] Tagging ocp/4.5:upi-installer into pipeline:ocp-4.5-upi-installer. INFO[2021-04-08T15:49:27Z] Importing release image latest. INFO[2021-04-08T15:52:09Z] Imported release 4.8.0-0.ci-2021-04-08-154429 created at 2021-04-08 15:48:46 +0000 UTC with 138 images to tag release:latest INFO[2021-04-08T15:52:09Z] Acquiring leases for test e2e-gcp ERRO[2021-04-08T17:52:09Z] error: Failed to acquire resource, current capacity: 0 free, 70 leased INFO[2021-04-08T17:52:09Z] Ran for 2h2m43s ERRO[2021-04-08T17:52:09Z] Some steps failed: ERRO[2021-04-08T17:52:09Z] * could not run steps: step e2e-gcp failed: failed to acquire lease: resources not found >> BUILD 1375113030177656832 >> BUILD URL: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1375113030177656832 >> RAW LOG URL: https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1375113030177656832/build-log.txt level=error msg=https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html level=error msg=The 'wait-for install-complete' subcommand can then be used to continue the installation level=fatal msg=failed to initialize the cluster: Working towards 4.8.0-0.ci-2021-03-25-155021: 687 of 689 done (99% complete), waiting on cloud-controller-manager Setup phase finished, prepare env for next steps Copying log bundle... Removing REDACTED info from log... Copying required artifacts to shared dir {"component":"entrypoint","error":"wrapped process failed: exit status 1","file":"prow/entrypoint/run.go:80","func":"k8s.io/test-infra/prow/entrypoint.Options.Run","level":"error","msg":"Error executing test process","severity":"error","time":"2021-03-25T18:31:04Z"} error: failed to execute wrapped command: exit status 1 2021/03/25 18:31:09 Container test in pod e2e-gcp-ipi-install-install failed, exit code 1, reason Error 2021/03/25 18:31:09 Executing pod "e2e-gcp-gather-core-dump" running image "stable:cli" 2021/03/25 18:31:20 Container place-entrypoint in pod e2e-gcp-gather-core-dump completed successfully 2021/03/25 18:31:20 Container cp-entrypoint-wrapper in pod e2e-gcp-gather-core-dump completed successfully 2021/03/25 18:31:50 Container sidecar in pod e2e-gcp-gather-core-dump completed successfully 2021/03/25 18:31:50 Container test in pod e2e-gcp-gather-core-dump completed successfully 2021/03/25 18:31:50 Pod e2e-gcp-gather-core-dump succeeded after 31s 2021/03/25 18:31:50 Executing pod "e2e-gcp-gather-gcp-console" running image "pipeline:ocp-4.5-upi-installer" 2021/03/25 18:32:00 Container place-entrypoint in pod e2e-gcp-gather-gcp-console completed successfully 2021/03/25 18:32:00 Container cp-entrypoint-wrapper in pod e2e-gcp-gather-gcp-console completed successfully 2021/03/25 18:32:40 Container test in pod e2e-gcp-gather-gcp-console completed successfully 2021/03/25 18:32:50 Container sidecar in pod e2e-gcp-gather-gcp-console completed successfully 2021/03/25 18:32:50 Pod e2e-gcp-gather-gcp-console succeeded after 49s 2021/03/25 18:32:50 Executing pod "e2e-gcp-gather-must-gather" running image "stable:cli" 2021/03/25 18:33:00 Container place-entrypoint in pod e2e-gcp-gather-must-gather completed successfully 2021/03/25 18:33:00 Container cp-entrypoint-wrapper in pod e2e-gcp-gather-must-gather completed successfully 2021/03/25 18:36:00 Container test in pod e2e-gcp-gather-must-gather completed successfully >> BUILD 1383399218122592256 >> BUILD URL: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1383399218122592256 >> RAW LOG URL: https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1383399218122592256/build-log.txt INFO[2021-04-17T13:43:59Z] Apr 17 13:19:49.383 I ns/e2e-nettest-6790 pod/netserver-1 node/ reason/Created INFO[2021-04-17T13:43:59Z] Apr 17 13:19:49.410 I ns/e2e-nettest-6790 pod/netserver-0 node/ci-op-tf25ybz0-f23e1-vhknz-worker-b-sflfq reason/Scheduled INFO[2021-04-17T13:43:59Z] Apr 17 13:19:49.442 E ns/e2e-statefulset-4735 pod/test-pod node/ci-op-tf25ybz0-f23e1-vhknz-worker-b-sflfq container/webserver container exited with code 1 (Error): INFO[2021-04-17T13:43:59Z] Apr 17 13:19:49.456 I ns/e2e-nettest-6790 pod/netserver-2 node/ reason/Created INFO[2021-04-17T13:43:59Z] Apr 17 13:19:49.473 I ns/e2e-nettest-6790 pod/netserver-1 node/ci-op-tf25ybz0-f23e1-vhknz-worker-c-85q2g reason/Scheduled INFO[2021-04-17T13:43:59Z] Apr 17 13:19:49.517 E ns/e2e-statefulset-4735 pod/ss-0 node/ci-op-tf25ybz0-f23e1-vhknz-worker-b-sflfq reason/Failed (NodePorts): Pod Predicate NodePorts failed INFO[2021-04-17T13:43:59Z] Apr 17 13:19:49.610 I ns/e2e-nettest-6790 pod/netserver-2 node/ci-op-tf25ybz0-f23e1-vhknz-worker-d-ksn4v reason/Scheduled INFO[2021-04-17T13:43:59Z] Apr 17 13:19:49.632 W ns/e2e-statefulset-4735 pod/ss-0 node/ci-op-tf25ybz0-f23e1-vhknz-worker-b-sflfq reason/GracefulDelete in 0s INFO[2021-04-17T13:43:59Z] Apr 17 13:19:49.672 W ns/e2e-statefulset-4735 pod/ss-0 node/ci-op-tf25ybz0-f23e1-vhknz-worker-b-sflfq reason/Deleted INFO[2021-04-17T13:43:59Z] Apr 17 13:19:49.723 I ns/e2e-statefulset-4735 pod/ss-0 node/ci-op-tf25ybz0-f23e1-vhknz-worker-b-sflfq reason/Created INFO[2021-04-17T13:43:59Z] Apr 17 13:19:49.737 I ns/e2e-volumemode-5842-6017 pod/csi-hostpath-snapshotter-0 node/ci-op-tf25ybz0-f23e1-vhknz-worker-b-sflfq container/csi-snapshotter reason/Ready INFO[2021-04-17T13:43:59Z] Apr 17 13:19:49.982 E ns/e2e-statefulset-4735 pod/ss-0 node/ci-op-tf25ybz0-f23e1-vhknz-worker-b-sflfq reason/Failed (NodePorts): Pod Predicate NodePorts failed INFO[2021-04-17T13:43:59Z] Apr 17 13:19:50.000 I ns/e2e-network-policy-8642 pod/client-b-hd6c9 node/ci-op-tf25ybz0-f23e1-vhknz-worker-d-ksn4v container/client reason/Created INFO[2021-04-17T13:43:59Z] Apr 17 13:19:50.000 I ns/e2e-network-policy-8642 pod/client-b-hd6c9 node/ci-op-tf25ybz0-f23e1-vhknz-worker-d-ksn4v container/client reason/Started INFO[2021-04-17T13:43:59Z] Apr 17 13:19:50.000 I ns/e2e-init-container-8983 pod/pod-init-a9d79f8a-1345-470d-809a-7187e250b285 node/ci-op-tf25ybz0-f23e1-vhknz-worker-c-85q2g container/init1 reason/Created >> BUILD 1375882831099596800 >> BUILD URL: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1375882831099596800 >> RAW LOG URL: https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1375882831099596800/build-log.txt When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information. error getting cluster version: Get "https://api.ci-op-5zgvkrl1-f23e1.origin-ci-int-gce.dev.openshift.com:6443/apis/config.openshift.io/v1/clusterversions/version": dial tcp 35.231.173.152:6443: i/o timeout ClusterID: ClusterVersion: Installing "" for <unknown>: <unknown> error getting cluster operators: Get "https://api.ci-op-5zgvkrl1-f23e1.origin-ci-int-gce.dev.openshift.com:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp 35.231.173.152:6443: i/o timeout ClusterOperators: clusteroperators are missing When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information. error getting cluster version: Get "https://api.ci-op-5zgvkrl1-f23e1.origin-ci-int-gce.dev.openshift.com:6443/apis/config.openshift.io/v1/clusterversions/version": dial tcp 35.231.173.152:6443: i/o timeout ClusterID: ClusterVersion: Installing "" for <unknown>: <unknown> error getting cluster operators: Get "https://api.ci-op-5zgvkrl1-f23e1.origin-ci-int-gce.dev.openshift.com:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp 35.231.173.152:6443: i/o timeout ClusterOperators: clusteroperators are missing Gathering data for ns/openshift-cloud-credential-operator... Gathering data for ns/openshift-config... Gathering data for ns/openshift-config-managed... Gathering data for ns/openshift-etcd-operator... Gathering data for ns/openshift-etcd... Gathering data for ns/openshift-kube-apiserver-operator... Gathering data for ns/openshift-kube-apiserver... Gathering data for ns/openshift-kube-controller-manager... Gathering data for ns/openshift-kube-controller-manager-operator... Gathering data for ns/openshift-kube-scheduler-operator... Gathering data for ns/openshift-kube-scheduler... Gathering data for ns/openshift-kube-storage-version-migrator-operator... Gathering data for ns/openshift-machine-config-operator... Wrote inspect data to /logs/artifacts/audit-logs. error running backup collection: errors ocurred while gathering data: [skipping gathering securitycontextconstraints.security.openshift.io due to error: the server doesn't have a resource type "securitycontextconstraints", skipping gathering podnetworkconnectivitychecks.controlplane.operator.openshift.io due to error: the server doesn't have a resource type "podnetworkconnectivitychecks", skipping gathering namespaces/openshift-kube-storage-version-migrator due to error: namespaces "openshift-kube-storage-version-migrator" not found, skipping gathering controllerconfigs.machineconfiguration.openshift.io due to error: the server doesn't have a resource type "controllerconfigs"]Unable to connect to the server: dial tcp 35.231.173.152:6443: i/o timeout

We can see that the build logs can contain quite diverse content, including multi-line command output, entries with partial structure (timestamps, log levels, componets) or completely unsturctured, all of that mixed together in the same log file.

Traditional approaches to automated log parsing would probably not work too well with this type of log data.

Conclusions

  • Failed jobs can contain very diverse types of logs. Different failure causes often lead to different type of content in the logs.
  • Successful jobs have significantly more constistent log output.
  • Even for successful jobs, we can't assume a consistent structure.
  • It should be relatively easy to classify builds from their duration and build-log lenght, but can we do the same based on the content of the logs?

Next steps

  • Look at specific steps instead of the overall job build log. The expectation would be that by looking at the same step we would get more consistency across job runs, both successful and failed.
  • More detailed exploration of the log contents by applying different techniques to log files themselves.