SMART Metrics Forecasting

In previous notebooks in this project, we explored how machine learning models can be trained to predict whether a hard drive would fail or not in a given future time interval. The ceph team believes that in addition to the predictions provided by these models, another piece of information that users would find helpful is the forecast of values of specific SMART metrics coming from their hard drives.

In this notebook, we explore how time series forecasting models can be used to predict future values of SMART metrics.

[2]

# imports
import pdb
import warnings
from IPython.display import display

import numpy as np
import pandas as pd

import dask.dataframe as dd
from dask.diagnostics import ProgressBar

import seaborn as sns

from matplotlib import pyplot as plt

from fbprophet import Prophet

from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX

from sklearn.linear_model import LinearRegression, BayesianRidge, Lasso

sns.set()
warnings.filterwarnings("ignore")

[3]

# dask tasks progress bar
pbar = ProgressBar()
pbar.register()

Read Data

In this section we'll read in the data to be used for training models. For this notebook, we'll use the open source backblaze hard drive dataset. Since the dataset is quite large, it might not fit in memory for all devices. So we'll use dask for lazy and parallel processing. Note that we are only using one quarter worth of data for this POC, but we can increase this for production models.

[4]

# read df and keep seagate data
df = dd.read_parquet(
    path="data_Q3_2020_parquet",
)
df = df[(df["model"].str.startswith("S")) | (df["model"].str.startswith("ZA"))]

[5]

# failed vs working serials
failed_serials = df[df["failure"] == 1]["serial_number"].unique().compute().values
all_serials = df["serial_number"].unique().compute().values
working_serials = np.setdiff1d(all_serials, failed_serials)

[########################################] | 100% Completed | 36.9s
[########################################] | 100% Completed | 33.9s

Clean Data

In this section we will clean up the data. We'll begin with a small set of smart metrics to use for training (smart_stats_to_keep). Then we'll determine which columns have the most amount of missing data, and if this missing data is specific to models. Based on this information we'll refine the smart metrics being kept in the dataset (i.e. update the smart_stats_to_keep and re-rerun the following cells). Finally, from the results in the analysis here, we'll determine data from which serial numbers (i.e. hard drives) should be kept so that we have a clean dataset.

[6]

# set which columns are metadata and which ones are smart attribuetes
meta_cols = ["date", "serial_number", "model", "capacity_bytes", "failure"]

# set which columns to use in analysis
# NOTE: this determined based on backblaze research, ibm paper, and SMART wiki
smart_stats_to_keep = [5, 187, 188, 197, 198]
smart_cols = [f"smart_{i}_raw" for i in smart_stats_to_keep]
smart_cols += [f"smart_{i}_normalized" for i in smart_stats_to_keep]

Failed Drives Data

[7]

# how much data is missing, device-wise and feature-wise
serialwise_featurewise_pct_nans = (
    df[df["serial_number"].isin(failed_serials)][["serial_number"] + smart_cols]
    .groupby("serial_number")
    .apply(lambda x: x.isna().mean())
    .compute()
)

mean_serialwise_pct_nans = serialwise_featurewise_pct_nans.mean(axis=1).sort_values(
    ascending=False
)
mean_serialwise_pct_nans.head(15)

[########################################] | 100% Completed | 42.5s

serial_number
7LZ022BP          0.909091
7LZ045HN          0.909091
S2ZYJ9GGB04771    0.363636
S2ZYJ9GGB04761    0.363636
S2ZYJ9FFB18437    0.363636
S2ZYJ9CF504020    0.363636
S2ZYJ9FG404851    0.363636
S2ZYJ9KG303913    0.363636
ZA180XLA          0.014430
ZHZ3SPDW          0.000000
ZA18GX05          0.000000
ZCH080S9          0.000000
ZA12MET8          0.000000
ZA180Q9Q          0.000000
ZJV1BTSY          0.000000
dtype: float64

[8]

# peek at a few samples to better understand how the nans occur

# dive with 90% data missing
print("7LZ022BP")
print(
    df[df["serial_number"] == "7LZ022BP"][smart_cols]
    .isna()
    .mean()
    .compute()
    .sort_values(ascending=False)
)

# drive with 1% data missing
print("ZA180XLA")
print(
    df[df["serial_number"] == "ZA180XLA"][smart_cols]
    .isna()
    .mean()
    .compute()
    .sort_values(ascending=False)
)

# drive with 30% data missing
print("S2ZYJ9GGB04771")
print(
    df[df["serial_number"] == "S2ZYJ9GGB04771"][smart_cols]
    .isna()
    .mean()
    .compute()
    .sort_values(ascending=False)
)

7LZ022BP
[########################################] | 100% Completed | 33.8s
smart_198_normalized    1.0
smart_197_normalized    1.0
smart_188_normalized    1.0
smart_187_normalized    1.0
smart_5_normalized      1.0
smart_198_raw           1.0
smart_197_raw           1.0
smart_188_raw           1.0
smart_187_raw           1.0
smart_5_raw             1.0
dtype: float64
ZA180XLA
[########################################] | 100% Completed | 32.1s
smart_198_normalized    0.015873
smart_197_normalized    0.015873
smart_188_normalized    0.015873
smart_187_normalized    0.015873
smart_5_normalized      0.015873
smart_198_raw           0.015873
smart_197_raw           0.015873
smart_188_raw           0.015873
smart_187_raw           0.015873
smart_5_raw             0.015873
dtype: float64
S2ZYJ9GGB04771
[########################################] | 100% Completed | 32.9s
smart_188_normalized    1.0
smart_187_normalized    1.0
smart_188_raw           1.0
smart_187_raw           1.0
smart_198_normalized    0.0
smart_197_normalized    0.0
smart_5_normalized      0.0
smart_198_raw           0.0
smart_197_raw           0.0
smart_5_raw             0.0
dtype: float64

NOTE From above samples we can see that nans occur either when a very few number of rows are missing, or when entire columns are missing.

[9]

# keep only devices that have no missing data
failed_ts_df = df[
    df["serial_number"].isin(
        mean_serialwise_pct_nans[mean_serialwise_pct_nans == 0].index
    )
][meta_cols + smart_cols].compute()

[########################################] | 100% Completed | 32.5s

Working Drives Data

[10]

# how much data is missing, device-wise and feature-wise
serialwise_featurewise_pct_nans = (
    df[df["serial_number"].isin(working_serials)][["serial_number"] + smart_cols]
    .groupby("serial_number")
    .apply(lambda x: x.isna().mean())
    .compute()
)

mean_serialwise_pct_nans = serialwise_featurewise_pct_nans.mean(axis=1).sort_values(
    ascending=False
)
mean_serialwise_pct_nans.head(15)

[########################################] | 100% Completed |  3min  9.5s

serial_number
7LZ01GHG    0.909091
7LZ01NDE    0.909091
7LZ032RA    0.909091
7LZ029KM    0.909091
NB120H9S    0.909091
NB1206GH    0.909091
7QT01D0F    0.909091
7QT01BPW    0.909091
7QT00H3G    0.909091
7LZ047MH    0.909091
7LZ045BC    0.909091
7LZ044SF    0.909091
7LZ036NC    0.909091
7LZ036LY    0.909091
7LZ02ZPX    0.909091
dtype: float64

[11]

# downsample working drives data - randomly select for now, use clustering in
# later improvement iterations
pct_working_to_keep = 0.00125
num_working_to_keep = int(pct_working_to_keep * len(working_serials))

# keep devices that have no missing data
working_ts_df = df[
    df["serial_number"].isin(
        mean_serialwise_pct_nans[mean_serialwise_pct_nans == 0].index[
            :num_working_to_keep
        ]
    )
][meta_cols + smart_cols].compute()

[########################################] | 100% Completed | 38.8s

[12]

# amount of clean data left to work with
working_ts_df.shape, failed_ts_df.shape

((12405, 15), (11735, 15))

[13]

# how much failing drive data do we lose?
dd.compute(df[df["serial_number"].isin(failed_serials)].shape)

[########################################] | 100% Completed | 38.1s

((12242, 131),)

[14]

# wont need dask anymore
pbar.unregister()

Results

By dropping nans, we only lost about 4% of the data (for failed hard drives). Since this does not seem like a completely unreasonable number, we won't do any other data cleaning gymnastics here. The data that we have now is clean and ready for building forecasting models.

Build Forecasting Models

In this section we will create forecasting models using the clean data from above section. You can play around with the these variables in the cell below to configure the prediction settings.

Y_COLS: The metric to forecast. Since smart 5 was determined to be a valueable indicator of failure, we'll use smart 5 as the default.
X_COLS: Other features available. Some time series models can make use of other features too for generating a forecast. We will use smart 10, 187, 188, 197 and 198 as these additional features. These were determined to be useful from Google research, IBM research, backblaze, and the smart metrics wiki.
NDAYS_DATA: How many days of data available in at runtime. Since the diskprediction_local module of ceph stores only 6 days of smart data, our models will only have that much data available.
NDAYS_TO_PREDICT: How many days into the future to forecast. Since we have a short amount of historical data, we will be forecasting for a short period too.

Metric SMART 5

In the following cells, we will predict the future values of the SMART 5 Metric. SMART 5 specifies the Reallocated Sectors Count for the disk, which is the count of the bad sectors that have been found and remapped.

[15]

# training setup config
Y_COLS = ["smart_5_normalized"]
y_smart_stats = set([col.split("_")[1] for col in Y_COLS])

X_COLS = [c for c in smart_cols if c.split("_")[1] not in y_smart_stats]

NDAYS_DATA = 6
NDAYS_TO_PREDICT = 6

[16]

# set date column as index
working_ts_df = working_ts_df.set_index("date")
failed_ts_df = failed_ts_df.set_index("date")

Baseline

In this section we will fit baseline models for the forecasting task. We will compare the performance of the ML models with this baseline to gauge how informative are the models. The baseline model here is one that predicts the next N days will be the same as today.

[17]

def get_baseline_mses_windowwise(drive_ts):
    """
    Predict the next N days the value will be same as today.
    Calculate MSE at the end of each training window
    """
    mses = pd.Series(index=drive_ts.index)
    for end_idx in range(NDAYS_DATA - 1, len(drive_ts)):
        mses.iloc[end_idx] = (
            (
                drive_ts[Y_COLS[0]].iloc[end_idx + 1 : end_idx + 1 + NDAYS_TO_PREDICT]
                - drive_ts[Y_COLS[0]].iloc[end_idx]
            )
            ** 2
        ).mean()

    return mses

Failed Drives

[18]

# calculate baseline
baseline_mses = failed_ts_df.groupby("serial_number").apply(
    get_baseline_mses_windowwise
)

# df to store all results
all_mses = baseline_mses.to_frame("mse_baseline")

# average mse across days
print("Average MSE for each drive:\n\n", baseline_mses.mean(level=0), end="\n\n")

Average MSE for each drive:

 serial_number
S300WFGR          NaN
S300Z4TZ     0.031410
S300Z7P1     0.000000
S301FDQW    10.322619
S301GMWQ     0.000000
              ...    
ZLW0G6CE     0.000000
ZLW0G6FJ          NaN
ZLW0GK7E     0.000000
ZLW0GKE4     0.000000
ZLW0GPC5          NaN
Length: 272, dtype: float64

[19]

# visualize how mses change over time when using baseline strategy
def visualize_mses(mses, to_plot=15):
    i = 0
    for ser in mses.index.get_level_values(0).unique():
        # get mse for each day/window
        ser_mses = mses.loc[ser]

        # dont plot if all nans
        if not ser_mses.isna().all():

            # dont plot if the metric to predict has had the same value till fail
            if failed_ts_df[failed_ts_df["serial_number"] == ser][Y_COLS[0]].std():
                fig, ax = plt.subplots(figsize=(16, 4))
                sns.lineplot(
                    ax=ax,
                    x=ser_mses.index,
                    y=ser_mses.values,
                )
                plt.title(ser)
                plt.ylabel("MSE")
                plt.xticks(rotation=90)
                plt.show()

                # dont plot everything
                i += 1
                if i == to_plot:
                    break


visualize_mses(baseline_mses)

Results

From the above graphs, we can see that the baseline strategy works fine when the hard drive is healthy i.e. shows no symptoms of failure. However, as it approaches end of life the error in baseline model predictions rises significantly. So it might seem like a good strategy but it actually provides very little information when it is needed the most.

Working Drives

[20]

# calculate baseline for healthy drives
baseline_mses_working = working_ts_df.groupby("serial_number").apply(
    get_baseline_mses_windowwise
)

# df to store all results
all_mses_working = baseline_mses_working.to_frame("mse_baseline")

# average mse across days
print("Average MSE for each drive:\n\n")
display(baseline_mses_working.mean(level=0))

Average MSE for each drive:

serial_number
S300VKRB    0.0
S300VLKS    0.0
S300WE24    0.0
S300WE6K    0.0
S300WEAV    0.0
           ... 
ZHZ656DW    0.0
ZHZ65C16    0.0
ZJV009E3    0.0
ZJV00AR3    0.0
ZR50121P    NaN
Length: 136, dtype: float64

[21]

# what is the WORST average mse using baseline strategy
baseline_mses_working.mean(level=0).max()

0.040697674418604654

Results

From the above results, it seems that baseline model works extremely well for healthy drives (MSE=0 for most hard drives). The graphs in the section above also indicated the same thing - there is barely any unexpected change in these SMART metrics when the drive is healthy. Therefore in the following sections we will evaluate models only on failed data, since any model would likely be decently performant (if not perfect) for healthy drives.

Linear Regression

[22]

# calculate windowwise mse for ols model for each drive
def get_linreg_mses_windowwise(drive_ts):
    # define model
    model = LinearRegression(
        normalize=True,
    )

    # create dummy exogenous variable (train)
    dummy_xtrain = np.arange(start=0, stop=NDAYS_DATA).reshape(-1, 1)

    mses = pd.Series(index=drive_ts.index)
    for end_idx in range(NDAYS_DATA - 1, len(drive_ts) - 1):
        # create dummy exogenous variable (test)
        dummy_xtest = np.arange(
            start=NDAYS_DATA,
            stop=NDAYS_DATA + min(NDAYS_TO_PREDICT, len(drive_ts) - end_idx - 1),
        ).reshape(-1, 1)

        # fit model
        model.fit(
            X=dummy_xtrain,
            y=drive_ts[[Y_COLS[0]]].iloc[end_idx + 1 - NDAYS_DATA : end_idx + 1],
        )

        # predict and get mse
        preds = model.predict(dummy_xtest).flatten()
        mses.iloc[end_idx] = (
            (
                drive_ts[Y_COLS[0]].iloc[end_idx + 1 : end_idx + 1 + NDAYS_TO_PREDICT]
                - preds
            )
            ** 2
        ).mean()
    return mses


windowwise_mses_linear = failed_ts_df.groupby("serial_number").apply(
    get_linreg_mses_windowwise
)

[40]

def calc_improvement(all_mses, windowwise_mses, mse_algo):
    # update all results df and compare
    updated_df = all_mses.assign(algo=windowwise_mses)
    updated_df = updated_df.rename({"algo": mse_algo}, axis="columns")

    # average mse across days
    mean_mses = updated_df[["mse_baseline", mse_algo]].groupby("serial_number").mean()

    # improvement over baseline
    diff = mean_mses[mse_algo] - mean_mses["mse_baseline"]
    mean_mses = mean_mses.reindex(diff.sort_values().index)
    print("Avg. MSE for each drive:")
    display(mean_mses)

    # average overall improvement over baseline
    print("Avg. Improvement =", -diff.mean())

    # peek at drive with most improvement
    print("\n peek at drive with most improvement")
    display(updated_df.loc[mean_mses.iloc[0].name])


calc_improvement(all_mses, windowwise_mses_linear, "linreg_mse")

Avg. MSE for each drive:

	mse_baseline	linreg_mse
serial_number
ZCH0EA28	381.806250	148.840343
ZCH0CCSY	736.158333	612.876751
ZJV5M9KA	460.544444	378.915628
ZCH074SE	169.000000	157.084444
ZA1819E9	11.441667	4.145687
...	...	...
ZHZ50G0W	NaN	NaN
ZHZ5W6L4	NaN	NaN
ZL005C4H	NaN	NaN
ZLW0G6FJ	NaN	NaN
ZLW0GPC5	NaN	NaN

272 rows × 2 columns

Avg. Improvement = 1.5228603706772657

 peek at drive with most improvement

	mse_baseline	linreg_mse
date
2020-07-01	NaN	NaN
2020-07-02	NaN	NaN
2020-07-03	NaN	NaN
2020-07-04	NaN	NaN
2020-07-05	NaN	NaN
2020-07-06	958.666667	57.512789
2020-07-07	973.333333	59.737506
2020-07-08	479.833333	227.104444
2020-07-09	122.200000	473.360000
2020-07-10	152.750000	274.646825
2020-07-11	203.666667	92.501769
2020-07-12	148.000000	1.859410
2020-07-13	16.000000	4.000000
2020-07-14	NaN	NaN

Lasso

[41]

# calculate windowwise mse for lasso model for each drive
def get_lasso_mses_windowwise(drive_ts):
    # define model
    model = Lasso(
        normalize=True,
        random_state=42,
    )

    # create dummy exogenous variable (train)
    dummy_xtrain = np.arange(start=0, stop=NDAYS_DATA).reshape(-1, 1)

    mses = pd.Series(index=drive_ts.index)
    for end_idx in range(NDAYS_DATA - 1, len(drive_ts) - 1):
        # create dummy exogenous variable (test)
        dummy_xtest = np.arange(
            start=NDAYS_DATA,
            stop=NDAYS_DATA + min(NDAYS_TO_PREDICT, len(drive_ts) - end_idx - 1),
        ).reshape(-1, 1)

        # fit model
        model.fit(
            X=dummy_xtrain,
            y=drive_ts[[Y_COLS[0]]].iloc[end_idx + 1 - NDAYS_DATA : end_idx + 1],
        )

        # predict and get mse
        preds = model.predict(dummy_xtest).flatten()
        mses.iloc[end_idx] = (
            (
                drive_ts[Y_COLS[0]].iloc[end_idx + 1 : end_idx + 1 + NDAYS_TO_PREDICT]
                - preds
            )
            ** 2
        ).mean()
    return mses


windowwise_mses_lasso = failed_ts_df.groupby("serial_number").apply(
    get_lasso_mses_windowwise
)

[42]

calc_improvement(all_mses, windowwise_mses_lasso, "lasso_mse")

Avg. MSE for each drive:

	mse_baseline	lasso_mse
serial_number
ZCH0EA28	381.806250	89.382469
ZJV5M9KA	460.544444	417.703068
ZCH0CCSY	736.158333	707.688888
S300Z4TZ	0.031410	0.031410
ZCH083F6	0.000000	0.000000
...	...	...
ZHZ50G0W	NaN	NaN
ZHZ5W6L4	NaN	NaN
ZL005C4H	NaN	NaN
ZLW0G6FJ	NaN	NaN
ZLW0GPC5	NaN	NaN

272 rows × 2 columns

Avg. Improvement = 0.1446787476441166

 peek at drive with most improvement

	mse_baseline	lasso_mse
date
2020-07-01	NaN	NaN
2020-07-02	NaN	NaN
2020-07-03	NaN	NaN
2020-07-04	NaN	NaN
2020-07-05	NaN	NaN
2020-07-06	958.666667	209.017283
2020-07-07	973.333333	82.382696
2020-07-08	479.833333	64.721416
2020-07-09	122.200000	188.592501
2020-07-10	152.750000	86.134941
2020-07-11	203.666667	11.390482
2020-07-12	148.000000	23.540592
2020-07-13	16.000000	49.279841
2020-07-14	NaN	NaN

Bayesian Ridge Regression

[43]

# calculate windowwise mse for br model for each drive
def get_br_mses_windowwise(drive_ts):
    # define model
    model = BayesianRidge(
        normalize=True,
    )

    # create dummy exogenous variable (train)
    dummy_xtrain = np.arange(start=0, stop=NDAYS_DATA).reshape(-1, 1)

    mses = pd.Series(index=drive_ts.index)
    for end_idx in range(NDAYS_DATA - 1, len(drive_ts) - 1):

        # create dummy exogenous variable (test)
        dummy_xtest = np.arange(
            start=NDAYS_DATA,
            stop=NDAYS_DATA + min(NDAYS_TO_PREDICT, len(drive_ts) - end_idx - 1),
        ).reshape(-1, 1)

        # fit model
        model.fit(
            X=dummy_xtrain,
            y=drive_ts[Y_COLS[0]].iloc[end_idx + 1 - NDAYS_DATA : end_idx + 1],
        )

        # predict and get mse
        preds = model.predict(dummy_xtest).flatten()

        if np.isnan(preds).any():
            pdb.set_trace()

        mses.iloc[end_idx] = (
            (
                drive_ts[Y_COLS[0]].iloc[end_idx + 1 : end_idx + 1 + NDAYS_TO_PREDICT]
                - preds
            )
            ** 2
        ).mean()
    return mses


windowwise_mses_br = failed_ts_df.groupby("serial_number").apply(get_br_mses_windowwise)

[44]

calc_improvement(all_mses, windowwise_mses_br, "br_mse")

Avg. MSE for each drive:

	mse_baseline	br_mse
serial_number
ZCH0EA28	381.806250	142.547238
ZCH0CCSY	736.158333	635.306334
ZJV5M9KA	460.544444	400.676324
ZA1819E9	11.441667	4.100040
ZCH074SE	169.000000	162.721665
...	...	...
ZHZ50G0W	NaN	NaN
ZHZ5W6L4	NaN	NaN
ZL005C4H	NaN	NaN
ZLW0G6FJ	NaN	NaN
ZLW0GPC5	NaN	NaN

272 rows × 2 columns

Avg. Improvement = 1.364011532336577

 peek at drive with most improvement

	mse_baseline	br_mse
date
2020-07-01	NaN	NaN
2020-07-02	NaN	NaN
2020-07-03	NaN	NaN
2020-07-04	NaN	NaN
2020-07-05	NaN	NaN
2020-07-06	958.666667	66.331898
2020-07-07	973.333333	57.570863
2020-07-08	479.833333	219.038516
2020-07-09	122.200000	457.571010
2020-07-10	152.750000	258.313124
2020-07-11	203.666667	73.581766
2020-07-12	148.000000	0.948216
2020-07-13	16.000000	7.022507
2020-07-14	NaN	NaN

ARIMA

[47]

# calculate windowwise mse for arima model for each drive
def get_arima_mses_windowwise(drive_ts):
    mses = pd.Series(index=drive_ts.index)

    for end_idx in range(NDAYS_DATA - 1, len(drive_ts)):
        try:
            # init model and train
            model = ARIMA(
                endog=drive_ts[[Y_COLS[0]]].iloc[
                    end_idx + 1 - NDAYS_DATA : end_idx + 1
                ],
                order=(1, 1, 0),
                freq="D",
            )
            model = model.fit()

            # forecast and calculate error
            preds = model.forecast(steps=NDAYS_TO_PREDICT)
            mses.iloc[end_idx] = (
                (
                    drive_ts[Y_COLS[0]].iloc[
                        end_idx + 1 : end_idx + 1 + NDAYS_TO_PREDICT
                    ]
                    - preds
                )
                ** 2
            ).mean()

        # value errors occur when dates are not continuous
        except ValueError:
            mses.iloc[end_idx] = np.nan

    return mses


windowwise_mses_arima = failed_ts_df.groupby("serial_number").apply(
    get_arima_mses_windowwise
)

[48]

calc_improvement(all_mses, windowwise_mses_arima, "arima_mse")

Avg. MSE for each drive:

	mse_baseline	arima_mse
serial_number
ZCH0EA28	381.806250	267.692262
ZCH0CCSY	736.158333	661.323586
ZCH074SE	169.000000	145.534586
ZA12RB44	4.000000	0.133154
ZA16DDTZ	9.000000	5.670446
...	...	...
ZHZ50G0W	NaN	NaN
ZHZ5W6L4	NaN	NaN
ZL005C4H	NaN	NaN
ZLW0G6FJ	NaN	NaN
ZLW0GPC5	NaN	NaN

272 rows × 2 columns

Avg. Improvement = 0.7232415431034219

 peek at drive with most improvement

	mse_baseline	arima_mse
date
2020-07-01	NaN	NaN
2020-07-02	NaN	NaN
2020-07-03	NaN	NaN
2020-07-04	NaN	NaN
2020-07-05	NaN	NaN
2020-07-06	958.666667	628.750569
2020-07-07	973.333333	524.365544
2020-07-08	479.833333	48.255805
2020-07-09	122.200000	530.488499
2020-07-10	152.750000	152.750000
2020-07-11	203.666667	203.666667
2020-07-12	148.000000	52.187715
2020-07-13	16.000000	1.073299
2020-07-14	NaN	NaN

SARIMAX

[49]

# calculate windowwise mse for sarimax model for each drive
def get_sarimax_mses_windowwise(drive_ts):
    mses = pd.Series(index=drive_ts.index)

    for end_idx in range(NDAYS_DATA - 1, len(drive_ts)):
        try:
            model = SARIMAX(
                endog=drive_ts[[Y_COLS[0]]].iloc[
                    end_idx + 1 - NDAYS_DATA : end_idx + 1
                ],
                order=(1, 1, 0),
                freq="D",
            )
            model = model.fit()
            preds = model.forecast(steps=NDAYS_TO_PREDICT)
            mses.iloc[end_idx] = (
                (
                    drive_ts[Y_COLS[0]].iloc[
                        end_idx + 1 : end_idx + 1 + NDAYS_TO_PREDICT
                    ]
                    - preds
                )
                ** 2
            ).mean()
        except ValueError:
            mses.iloc[end_idx] = np.nan
    return mses


windowwise_mses_sarimax = failed_ts_df.groupby("serial_number").apply(
    get_sarimax_mses_windowwise
)

[50]

calc_improvement(all_mses, windowwise_mses_sarimax, "sarimax_mse")

Avg. MSE for each drive:

	mse_baseline	sarimax_mse
serial_number
ZCH0EA28	381.806250	267.692262
ZCH0CCSY	736.158333	661.323586
ZCH074SE	169.000000	145.534586
ZA12RB44	4.000000	0.133154
ZA16DDTZ	9.000000	5.670446
...	...	...
ZHZ50G0W	NaN	NaN
ZHZ5W6L4	NaN	NaN
ZL005C4H	NaN	NaN
ZLW0G6FJ	NaN	NaN
ZLW0GPC5	NaN	NaN

272 rows × 2 columns

Avg. Improvement = 0.7232415431034219

 peek at drive with most improvement

	mse_baseline	sarimax_mse
date
2020-07-01	NaN	NaN
2020-07-02	NaN	NaN
2020-07-03	NaN	NaN
2020-07-04	NaN	NaN
2020-07-05	NaN	NaN
2020-07-06	958.666667	628.750569
2020-07-07	973.333333	524.365544
2020-07-08	479.833333	48.255805
2020-07-09	122.200000	530.488499
2020-07-10	152.750000	152.750000
2020-07-11	203.666667	203.666667
2020-07-12	148.000000	52.187715
2020-07-13	16.000000	1.073299
2020-07-14	NaN	NaN

Prophet

[51]

# calculate windowwise mse for sarimax model for each drive
def get_prophet_mses_windowwise(drive_ts):
    mses = pd.Series(index=drive_ts.index)
    for end_idx in range(NDAYS_DATA - 1, len(drive_ts) - 1):
        try:
            # init prophet model
            model = Prophet(
                n_changepoints=3,
                daily_seasonality=True,
                weekly_seasonality=False,
                yearly_seasonality=False,
                uncertainty_samples=False,
            )

            # fit to historical data (today to today minus NDAYS_TO_PREDICT)
            model.fit(
                drive_ts[Y_COLS]
                .iloc[end_idx + 1 - NDAYS_DATA : end_idx + 1]
                .reset_index()
                .rename(columns={"date": "ds", f"{Y_COLS[0]}": "y"})
            )

            # predict future
            preds = model.predict(
                model.make_future_dataframe(
                    periods=min(NDAYS_TO_PREDICT, len(drive_ts) - end_idx - 1),
                    include_history=False,
                )
            )["yhat"].values

            # calculate mean squared error at this date
            mses.iloc[end_idx] = (
                (
                    drive_ts[Y_COLS[0]]
                    .iloc[end_idx + 1 : end_idx + 1 + NDAYS_TO_PREDICT]
                    .values
                    - preds
                )
                ** 2
            ).mean()
        except ValueError:
            mses.iloc[end_idx] = np.nan
    return mses


windowwise_mses_prophet = failed_ts_df.groupby("serial_number").apply(
    get_prophet_mses_windowwise
)

[52]

calc_improvement(all_mses, windowwise_mses_prophet, "prophet_mse")

Avg. MSE for each drive:

	mse_baseline	prophet_mse
serial_number
ZCH0EA28	381.806250	148.774488
ZCH0CCSY	736.158333	591.042024
ZJV5M9KA	460.544444	376.511968
ZA180YNM	62.883333	50.712123
ZCH074SE	169.000000	157.133496
...	...	...
ZHZ50G0W	NaN	NaN
ZHZ5W6L4	NaN	NaN
ZL005C4H	NaN	NaN
ZLW0G6FJ	NaN	NaN
ZLW0GPC5	NaN	NaN

272 rows × 2 columns

Avg. Improvement = 1.695440886904843

 peek at drive with most improvement

	mse_baseline	prophet_mse
date
2020-07-01	NaN	NaN
2020-07-02	NaN	NaN
2020-07-03	NaN	NaN
2020-07-04	NaN	NaN
2020-07-05	NaN	NaN
2020-07-06	958.666667	57.550052
2020-07-07	973.333333	59.718423
2020-07-08	479.833333	226.997537
2020-07-09	122.200000	473.553431
2020-07-10	152.750000	274.366472
2020-07-11	203.666667	92.204758
2020-07-12	148.000000	1.774653
2020-07-13	16.000000	4.030581
2020-07-14	NaN	NaN

From the above experiments it seems like the Prophet model outperforms the rest for SMART 5 metric. However, a simple OLS model works quite well too with its average MSE not too far away from that of Prophet. Nonetheless, Prophet has the advantage of providing confidence intervals, and also account for weekly/daily/yearly seasonality. Therefore it is likely that when the amount of available historical data at runtime is increased, Prophet would perform even better than OLS.

Metric SMART 187

In the following cells, we will predict the future values of the SMART 187 Metric. SMART 187 specifies the count of errors that could not be recovered using hardware ECC.

[53]

# training setup config
Y_COLS = ["smart_187_normalized"]
y_smart_stats = set([col.split("_")[1] for col in Y_COLS])

X_COLS = [c for c in smart_cols if c.split("_")[1] not in y_smart_stats]

NDAYS_DATA = 6
NDAYS_TO_PREDICT = 6

[54]

# calculate baseline
baseline_mses = failed_ts_df.groupby("serial_number").apply(
    get_baseline_mses_windowwise
)

# df to store all results
all_mses = baseline_mses.to_frame("mse_baseline")

# average mse across days
print("Average MSE for each drive:\n\n", baseline_mses.mean(level=0), end="\n\n")

Average MSE for each drive:

 serial_number
S300WFGR            NaN
S300Z4TZ      32.910256
S300Z7P1       2.544231
S301FDQW     257.093750
S301GMWQ      24.436364
               ...     
ZLW0G6CE    1263.813158
ZLW0G6FJ            NaN
ZLW0GK7E       0.000000
ZLW0GKE4       0.000000
ZLW0GPC5            NaN
Length: 272, dtype: float64

[55]

visualize_mses(baseline_mses, 5)

INFO:matplotlib.category:Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
INFO:matplotlib.category:Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.

INFO:matplotlib.category:Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
INFO:matplotlib.category:Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.

INFO:matplotlib.category:Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
INFO:matplotlib.category:Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.

INFO:matplotlib.category:Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
INFO:matplotlib.category:Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.

INFO:matplotlib.category:Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
INFO:matplotlib.category:Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.

Working Drives

[56]

# calculate baseline for healthy drives
baseline_mses_working = working_ts_df.groupby("serial_number").apply(
    get_baseline_mses_windowwise
)

# df to store all results
all_mses_working = baseline_mses_working.to_frame("mse_baseline")

# average mse across days
print("Average MSE for each drive:\n\n")
display(baseline_mses_working.mean(level=0))

Average MSE for each drive:

serial_number
S300VKRB    0.0
S300VLKS    0.0
S300WE24    0.0
S300WE6K    0.0
S300WEAV    0.0
           ... 
ZHZ656DW    0.0
ZHZ65C16    0.0
ZJV009E3    0.0
ZJV00AR3    0.0
ZR50121P    NaN
Length: 136, dtype: float64

[57]

# what is the WORST average mse using baseline strategy
baseline_mses_working.mean(level=0).max()

3.7151162790697674

Linear Regression

[58]

windowwise_mses_linear = failed_ts_df.groupby("serial_number").apply(
    get_linreg_mses_windowwise
)
calc_improvement(all_mses, windowwise_mses_linear, "linreg_mse")

Avg. MSE for each drive:

	mse_baseline	linreg_mse
serial_number
Z3052KFM	2104.193750	1904.334640
ZA180YNM	927.082456	835.112770
Z304JL8D	261.437879	174.055424
Z302SYFG	163.941975	84.107804
ZA180971	185.429167	145.494930
...	...	...
ZHZ50G0W	NaN	NaN
ZHZ5W6L4	NaN	NaN
ZL005C4H	NaN	NaN
ZLW0G6FJ	NaN	NaN
ZLW0GPC5	NaN	NaN

272 rows × 2 columns

Avg. Improvement = -13.544770957108467

 peek at drive with most improvement

	mse_baseline	linreg_mse
date
2020-07-01	NaN	NaN
2020-07-02	NaN	NaN
2020-07-03	NaN	NaN
2020-07-04	NaN	NaN
2020-07-05	NaN	NaN
2020-07-06	1958.333333	1871.334785
2020-07-07	2922.500000	2779.653061
2020-07-08	4015.166667	3609.422313
2020-07-09	3723.800000	3170.509751
2020-07-10	3868.750000	2944.942063
2020-07-11	201.000000	180.624308
2020-07-12	144.000000	136.853061
2020-07-13	0.000000	541.337778
2020-07-14	NaN	NaN

Lasso

[59]

windowwise_mses_lasso = failed_ts_df.groupby("serial_number").apply(
    get_lasso_mses_windowwise
)
calc_improvement(all_mses, windowwise_mses_lasso, "lasso_mse")

Avg. MSE for each drive:

	mse_baseline	lasso_mse
serial_number
Z304JL8D	261.437879	205.298409
Z302SYFG	163.941975	161.501765
ZCH07RST	112.985526	111.428219
ZA1815CT	0.000000	0.000000
ZCH05TR4	0.000000	0.000000
...	...	...
ZHZ50G0W	NaN	NaN
ZHZ5W6L4	NaN	NaN
ZL005C4H	NaN	NaN
ZLW0G6FJ	NaN	NaN
ZLW0GPC5	NaN	NaN

272 rows × 2 columns

Avg. Improvement = -16.399749054351826

 peek at drive with most improvement

	mse_baseline	lasso_mse
date
2020-07-01	NaN	NaN
2020-07-02	NaN	NaN
2020-07-03	NaN	NaN
2020-07-04	NaN	NaN
2020-07-05	NaN	NaN
...	...	...
2020-08-26	90.75	455.673442
2020-08-27	121.00	57.964244
2020-08-28	0.00	300.594005
2020-08-29	0.00	337.825865
2020-08-30	NaN	NaN

61 rows × 2 columns

Bayesian Ridge Regression

[60]

windowwise_mses_br = failed_ts_df.groupby("serial_number").apply(get_br_mses_windowwise)
calc_improvement(all_mses, windowwise_mses_br, "br_mse")

Avg. MSE for each drive:

	mse_baseline	br_mse
serial_number
Z3052KFM	2104.193750	1945.070187
Z304JL8D	261.437879	188.325664
Z302SYFG	163.941975	97.107672
ZA180YNM	927.082456	864.843916
ZA180971	185.429167	155.876973
...	...	...
ZHZ50G0W	NaN	NaN
ZHZ5W6L4	NaN	NaN
ZL005C4H	NaN	NaN
ZLW0G6FJ	NaN	NaN
ZLW0GPC5	NaN	NaN

272 rows × 2 columns

Avg. Improvement = -12.574399183281896

 peek at drive with most improvement

	mse_baseline	br_mse
date
2020-07-01	NaN	NaN
2020-07-02	NaN	NaN
2020-07-03	NaN	NaN
2020-07-04	NaN	NaN
2020-07-05	NaN	NaN
2020-07-06	1958.333333	1881.106322
2020-07-07	2922.500000	2843.440680
2020-07-08	4015.166667	3671.497463
2020-07-09	3723.800000	3270.626857
2020-07-10	3868.750000	3001.388307
2020-07-11	201.000000	316.236658
2020-07-12	144.000000	94.783536
2020-07-13	0.000000	481.481670
2020-07-14	NaN	NaN

ARIMA

[61]

windowwise_mses_arima = failed_ts_df.groupby("serial_number").apply(
    get_arima_mses_windowwise
)
calc_improvement(all_mses, windowwise_mses_arima, "arima_mse")

Avg. MSE for each drive:

	mse_baseline	arima_mse
serial_number
ZJV024DY	418.429060	158.945946
Z3052KFM	2104.193750	2000.392801
ZA1752L5	884.004545	823.868182
ZA180YNM	927.082456	872.846928
ZCH0CCSY	219.554762	186.018570
...	...	...
ZHZ50G0W	NaN	NaN
ZHZ5W6L4	NaN	NaN
ZL005C4H	NaN	NaN
ZLW0G6FJ	NaN	NaN
ZLW0GPC5	NaN	NaN

272 rows × 2 columns

Avg. Improvement = 1.3287091876332966

 peek at drive with most improvement

	mse_baseline	arima_mse
date
2020-07-01	NaN	NaN
2020-07-02	NaN	NaN
2020-07-03	NaN	NaN
2020-07-04	NaN	NaN
2020-07-05	NaN	NaN
...	...	...
2020-09-19	7351.0	NaN
2020-09-20	9604.0	NaN
2020-09-21	0.0	NaN
2020-09-22	0.0	NaN
2020-09-23	NaN	NaN

84 rows × 2 columns

SARIMAX

[62]

windowwise_mses_sarimax = failed_ts_df.groupby("serial_number").apply(
    get_sarimax_mses_windowwise
)
calc_improvement(all_mses, windowwise_mses_sarimax, "sarimax_mse")

Avg. MSE for each drive:

	mse_baseline	sarimax_mse
serial_number
ZJV024DY	418.429060	158.945946
Z3052KFM	2104.193750	2000.392801
ZA1752L5	884.004545	823.868182
ZA180YNM	927.082456	872.846928
ZCH0CCSY	219.554762	186.018570
...	...	...
ZHZ50G0W	NaN	NaN
ZHZ5W6L4	NaN	NaN
ZL005C4H	NaN	NaN
ZLW0G6FJ	NaN	NaN
ZLW0GPC5	NaN	NaN

272 rows × 2 columns

Avg. Improvement = 1.3287091876332966

 peek at drive with most improvement

	mse_baseline	sarimax_mse
date
2020-07-01	NaN	NaN
2020-07-02	NaN	NaN
2020-07-03	NaN	NaN
2020-07-04	NaN	NaN
2020-07-05	NaN	NaN
...	...	...
2020-09-19	7351.0	NaN
2020-09-20	9604.0	NaN
2020-09-21	0.0	NaN
2020-09-22	0.0	NaN
2020-09-23	NaN	NaN

84 rows × 2 columns

Prophet

[63]

windowwise_mses_prophet = failed_ts_df.groupby("serial_number").apply(
    get_prophet_mses_windowwise
)
calc_improvement(all_mses, windowwise_mses_prophet, "prophet_mse")

Avg. MSE for each drive:

	mse_baseline	prophet_mse
serial_number
Z3052KFM	2104.193750	1791.542133
ZA180YNM	927.082456	835.242244
Z304JL8D	261.437879	178.431901
Z302SYFG	163.941975	84.363744
ZA12CASG	119.248765	81.321330
...	...	...
ZHZ50G0W	NaN	NaN
ZHZ5W6L4	NaN	NaN
ZL005C4H	NaN	NaN
ZLW0G6FJ	NaN	NaN
ZLW0GPC5	NaN	NaN

272 rows × 2 columns

Avg. Improvement = -12.634035302462422

 peek at drive with most improvement

	mse_baseline	prophet_mse
date
2020-07-01	NaN	NaN
2020-07-02	NaN	NaN
2020-07-03	NaN	NaN
2020-07-04	NaN	NaN
2020-07-05	NaN	NaN
2020-07-06	1958.333333	1958.333333
2020-07-07	2922.500000	2474.733351
2020-07-08	4015.166667	2929.241579
2020-07-09	3723.800000	3161.276150
2020-07-10	3868.750000	2946.888588
2020-07-11	201.000000	183.080278
2020-07-12	144.000000	136.394795
2020-07-13	0.000000	542.388985
2020-07-14	NaN	NaN

From the above experiments, it seems like the Arima and sarimax model performs best for SMART 187 metric. However, linear regression, lasso, bayesian ridge regression, and prophet performed worse than the baseline model.

Metric SMART 188

In the following cells, we will predict the future values of the SMART 188 Metric. SMART 188 specifies the count of aborted operations due to HDD timeout. Normally this attribute value should be equal to zero.

[64]

# training setup config
Y_COLS = ["smart_188_normalized"]
y_smart_stats = set([col.split("_")[1] for col in Y_COLS])

X_COLS = [c for c in smart_cols if c.split("_")[1] not in y_smart_stats]

NDAYS_DATA = 6
NDAYS_TO_PREDICT = 6

[65]

# calculate baseline
baseline_mses = failed_ts_df.groupby("serial_number").apply(
    get_baseline_mses_windowwise
)

# df to store all results
all_mses = baseline_mses.to_frame("mse_baseline")

# average mse across days
print("Average MSE for each drive:\n\n", baseline_mses.mean(level=0), end="\n\n")

Average MSE for each drive:

 serial_number
S300WFGR    NaN
S300Z4TZ    0.0
S300Z7P1    0.0
S301FDQW    0.0
S301GMWQ    0.0
           ... 
ZLW0G6CE    0.0
ZLW0G6FJ    NaN
ZLW0GK7E    0.0
ZLW0GKE4    0.0
ZLW0GPC5    NaN
Length: 272, dtype: float64

[66]

visualize_mses(baseline_mses, 5)

INFO:matplotlib.category:Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
INFO:matplotlib.category:Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.

INFO:matplotlib.category:Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
INFO:matplotlib.category:Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.

INFO:matplotlib.category:Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
INFO:matplotlib.category:Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.

INFO:matplotlib.category:Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
INFO:matplotlib.category:Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.

INFO:matplotlib.category:Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
INFO:matplotlib.category:Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.

Working Drives

[67]

# calculate baseline for healthy drives
baseline_mses_working = working_ts_df.groupby("serial_number").apply(
    get_baseline_mses_windowwise
)

# df to store all results
all_mses_working = baseline_mses_working.to_frame("mse_baseline")

# average mse across days
print("Average MSE for each drive:\n\n")
display(baseline_mses_working.mean(level=0))

Average MSE for each drive:

serial_number
S300VKRB    0.0
S300VLKS    0.0
S300WE24    0.0
S300WE6K    0.0
S300WEAV    0.0
           ... 
ZHZ656DW    0.0
ZHZ65C16    0.0
ZJV009E3    0.0
ZJV00AR3    0.0
ZR50121P    NaN
Length: 136, dtype: float64

[68]

# what is the WORST average mse using baseline strategy
baseline_mses_working.mean(level=0).max()

0.0

Linear Regression

[69]

windowwise_mses_linear = failed_ts_df.groupby("serial_number").apply(
    get_linreg_mses_windowwise
)
calc_improvement(all_mses, windowwise_mses_linear, "linreg_mse")

Avg. MSE for each drive:

	mse_baseline	linreg_mse
serial_number
ZCH07G29	2.838418	2.235782
S300Z4TZ	0.000000	0.000000
ZCH088QQ	0.000000	0.000000
ZCH08AGH	0.000000	0.000000
ZCH08ASZ	0.000000	0.000000
...	...	...
ZHZ50G0W	NaN	NaN
ZHZ5W6L4	NaN	NaN
ZL005C4H	NaN	NaN
ZLW0G6FJ	NaN	NaN
ZLW0GPC5	NaN	NaN

272 rows × 2 columns

Avg. Improvement = 0.002168270638062176

 peek at drive with most improvement

	mse_baseline	linreg_mse
date
2020-07-01	NaN	NaN
2020-07-02	NaN	NaN
2020-07-03	NaN	NaN
2020-07-04	NaN	NaN
2020-07-05	NaN	NaN
...	...	...
2020-08-30	16.000000	16.000000
2020-08-31	21.333333	21.333333
2020-09-01	32.000000	32.000000
2020-09-02	64.000000	28.444444
2020-09-03	NaN	NaN

65 rows × 2 columns

Lasso

[70]

windowwise_mses_lasso = failed_ts_df.groupby("serial_number").apply(
    get_lasso_mses_windowwise
)
calc_improvement(all_mses, windowwise_mses_lasso, "lasso_mse")

Avg. MSE for each drive:

	mse_baseline	lasso_mse
serial_number
ZCH07G29	2.838418	1.783804
S300Z4TZ	0.000000	0.000000
ZCH088QQ	0.000000	0.000000
ZCH08AGH	0.000000	0.000000
ZCH08ASZ	0.000000	0.000000
...	...	...
ZHZ50G0W	NaN	NaN
ZHZ5W6L4	NaN	NaN
ZL005C4H	NaN	NaN
ZLW0G6FJ	NaN	NaN
ZLW0GPC5	NaN	NaN

272 rows × 2 columns

Avg. Improvement = 0.00371839697081104

 peek at drive with most improvement

	mse_baseline	lasso_mse
date
2020-07-01	NaN	NaN
2020-07-02	NaN	NaN
2020-07-03	NaN	NaN
2020-07-04	NaN	NaN
2020-07-05	NaN	NaN
...	...	...
2020-08-30	16.000000	16.000000
2020-08-31	21.333333	21.333333
2020-09-01	32.000000	32.000000
2020-09-02	64.000000	1.777778
2020-09-03	NaN	NaN

65 rows × 2 columns

Bayesian Ridge Regression

[71]

windowwise_mses_br = failed_ts_df.groupby("serial_number").apply(get_br_mses_windowwise)
calc_improvement(all_mses, windowwise_mses_br, "br_mse")

Avg. MSE for each drive:

	mse_baseline	br_mse
serial_number
ZCH07G29	2.838418	2.062212
S300Z4TZ	0.000000	0.000000
ZCH088QQ	0.000000	0.000000
ZCH08AGH	0.000000	0.000000
ZCH08ASZ	0.000000	0.000000
...	...	...
ZHZ50G0W	NaN	NaN
ZHZ5W6L4	NaN	NaN
ZL005C4H	NaN	NaN
ZLW0G6FJ	NaN	NaN
ZLW0GPC5	NaN	NaN

272 rows × 2 columns

Avg. Improvement = 0.00288318573879761

 peek at drive with most improvement

	mse_baseline	br_mse
date
2020-07-01	NaN	NaN
2020-07-02	NaN	NaN
2020-07-03	NaN	NaN
2020-07-04	NaN	NaN
2020-07-05	NaN	NaN
...	...	...
2020-08-30	16.000000	16.000000
2020-08-31	21.333333	21.333333
2020-09-01	32.000000	32.000000
2020-09-02	64.000000	18.203816
2020-09-03	NaN	NaN

65 rows × 2 columns

ARIMA

[72]

windowwise_mses_arima = failed_ts_df.groupby("serial_number").apply(
    get_arima_mses_windowwise
)
calc_improvement(all_mses, windowwise_mses_arima, "arima_mse")

Avg. MSE for each drive:

	mse_baseline	arima_mse
serial_number
S300Z4TZ	0.0	0.0
ZCH088QQ	0.0	0.0
ZCH08AGH	0.0	0.0
ZCH08ASZ	0.0	0.0
ZCH09CT2	0.0	0.0
...	...	...
ZHZ50G0W	NaN	NaN
ZHZ5W6L4	NaN	NaN
ZL005C4H	NaN	NaN
ZLW0G6FJ	NaN	NaN
ZLW0GPC5	NaN	NaN

272 rows × 2 columns

Avg. Improvement = -0.0006640456981720934

 peek at drive with most improvement

	mse_baseline	arima_mse
date
2020-07-01	NaN	NaN
2020-07-02	NaN	NaN
2020-07-03	NaN	NaN
2020-07-04	NaN	NaN
2020-07-05	NaN	NaN
...	...	...
2020-09-18	0.0	0.0
2020-09-19	0.0	0.0
2020-09-20	0.0	0.0
2020-09-21	0.0	0.0
2020-09-22	NaN	NaN

84 rows × 2 columns

SARIMAX

[73]

windowwise_mses_sarimax = failed_ts_df.groupby("serial_number").apply(
    get_sarimax_mses_windowwise
)
calc_improvement(all_mses, windowwise_mses_sarimax, "sarimax_mse")

Avg. MSE for each drive:

	mse_baseline	sarimax_mse
serial_number
S300Z4TZ	0.0	0.0
ZCH088QQ	0.0	0.0
ZCH08AGH	0.0	0.0
ZCH08ASZ	0.0	0.0
ZCH09CT2	0.0	0.0
...	...	...
ZHZ50G0W	NaN	NaN
ZHZ5W6L4	NaN	NaN
ZL005C4H	NaN	NaN
ZLW0G6FJ	NaN	NaN
ZLW0GPC5	NaN	NaN

272 rows × 2 columns

Avg. Improvement = -0.0006640456981720934

 peek at drive with most improvement

	mse_baseline	sarimax_mse
date
2020-07-01	NaN	NaN
2020-07-02	NaN	NaN
2020-07-03	NaN	NaN
2020-07-04	NaN	NaN
2020-07-05	NaN	NaN
...	...	...
2020-09-18	0.0	0.0
2020-09-19	0.0	0.0
2020-09-20	0.0	0.0
2020-09-21	0.0	0.0
2020-09-22	NaN	NaN

84 rows × 2 columns

Prophet

[74]

windowwise_mses_prophet = failed_ts_df.groupby("serial_number").apply(
    get_prophet_mses_windowwise
)
calc_improvement(all_mses, windowwise_mses_prophet, "prophet_mse")

Avg. MSE for each drive:

	mse_baseline	prophet_mse
serial_number
ZCH07G29	2.838418	2.231074
S300Z4TZ	0.000000	0.000000
ZCH088QQ	0.000000	0.000000
ZCH08AGH	0.000000	0.000000
ZCH08ASZ	0.000000	0.000000
...	...	...
ZHZ50G0W	NaN	NaN
ZHZ5W6L4	NaN	NaN
ZL005C4H	NaN	NaN
ZLW0G6FJ	NaN	NaN
ZLW0GPC5	NaN	NaN

272 rows × 2 columns

Avg. Improvement = 0.0016292468309516053

 peek at drive with most improvement

	mse_baseline	prophet_mse
date
2020-07-01	NaN	NaN
2020-07-02	NaN	NaN
2020-07-03	NaN	NaN
2020-07-04	NaN	NaN
2020-07-05	NaN	NaN
...	...	...
2020-08-30	16.000000	16.000000
2020-08-31	21.333333	21.333333
2020-09-01	32.000000	32.000000
2020-09-02	64.000000	28.166686
2020-09-03	NaN	NaN

65 rows × 2 columns

From the above experiments, we can see that lasso performed a little better than the baseline model, followed by Bayesian ridge regression, linear regression, and prophet for smart 188 metric. Arima and Sarimax performed worse than the baseline model.

Metric SMART 197

In the following cells, we will predict the future values of the SMART 197 Metric. SMART 197 specifies the count of "unstable" sectors (waiting to be remapped, because of unrecoverable read errors). If an unstable sector is subsequently read successfully, the sector is remapped and this value is decreased.

[75]

# training setup config
Y_COLS = ["smart_197_normalized"]
y_smart_stats = set([col.split("_")[1] for col in Y_COLS])

X_COLS = [c for c in smart_cols if c.split("_")[1] not in y_smart_stats]

NDAYS_DATA = 6
NDAYS_TO_PREDICT = 6

[76]

# calculate baseline
baseline_mses = failed_ts_df.groupby("serial_number").apply(
    get_baseline_mses_windowwise
)

# df to store all results
all_mses = baseline_mses.to_frame("mse_baseline")

# average mse across days
print("Average MSE for each drive:\n\n", baseline_mses.mean(level=0), end="\n\n")

Average MSE for each drive:

 serial_number
S300WFGR           NaN
S300Z4TZ     41.323291
S300Z7P1      0.000000
S301FDQW    368.038690
S301GMWQ      0.000000
               ...    
ZLW0G6CE      0.000000
ZLW0G6FJ           NaN
ZLW0GK7E      0.000000
ZLW0GKE4      0.000000
ZLW0GPC5           NaN
Length: 272, dtype: float64

[77]

visualize_mses(baseline_mses, 5)

INFO:matplotlib.category:Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
INFO:matplotlib.category:Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.

INFO:matplotlib.category:Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
INFO:matplotlib.category:Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.

INFO:matplotlib.category:Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
INFO:matplotlib.category:Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.

INFO:matplotlib.category:Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
INFO:matplotlib.category:Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.

INFO:matplotlib.category:Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
INFO:matplotlib.category:Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.

Working Drives

[78]

# calculate baseline for healthy drives
baseline_mses_working = working_ts_df.groupby("serial_number").apply(
    get_baseline_mses_windowwise
)

# df to store all results
all_mses_working = baseline_mses_working.to_frame("mse_baseline")

# average mse across days
print("Average MSE for each drive:\n\n")
display(baseline_mses_working.mean(level=0))

Average MSE for each drive:

serial_number
S300VKRB    0.0
S300VLKS    0.0
S300WE24    0.0
S300WE6K    0.0
S300WEAV    0.0
           ... 
ZHZ656DW    0.0
ZHZ65C16    0.0
ZJV009E3    0.0
ZJV00AR3    0.0
ZR50121P    NaN
Length: 136, dtype: float64

[79]

# what is the WORST average mse using baseline strategy
baseline_mses_working.mean(level=0).max()

0.0

Linear Regression

[80]

windowwise_mses_linear = failed_ts_df.groupby("serial_number").apply(
    get_linreg_mses_windowwise
)
calc_improvement(all_mses, windowwise_mses_linear, "linreg_mse")

Avg. MSE for each drive:

	mse_baseline	linreg_mse
serial_number
ZCH07G29	293.335593	224.181284
S301GMY2	15.288542	13.293533
Z302SYFG	6.602469	5.297197
ZA1815CT	0.000000	0.000000
ZCH08ASZ	0.000000	0.000000
...	...	...
ZHZ50G0W	NaN	NaN
ZHZ5W6L4	NaN	NaN
ZL005C4H	NaN	NaN
ZLW0G6FJ	NaN	NaN
ZLW0GPC5	NaN	NaN

272 rows × 2 columns

Avg. Improvement = 0.12337602853312885

 peek at drive with most improvement

	mse_baseline	linreg_mse
date
2020-07-01	NaN	NaN
2020-07-02	NaN	NaN
2020-07-03	NaN	NaN
2020-07-04	NaN	NaN
2020-07-05	NaN	NaN
...	...	...
2020-08-30	3223.0	2088.944444
2020-08-31	4168.0	2882.342404
2020-09-01	324.0	572.070975
2020-09-02	0.0	37.617778
2020-09-03	NaN	NaN

65 rows × 2 columns

Lasso

[81]

windowwise_mses_lasso = failed_ts_df.groupby("serial_number").apply(
    get_lasso_mses_windowwise
)
calc_improvement(all_mses, windowwise_mses_lasso, "lasso_mse")

Avg. MSE for each drive:

	mse_baseline	lasso_mse
serial_number
ZA1815CT	0.0	0.0
ZCH088QQ	0.0	0.0
ZCH08AGH	0.0	0.0
ZCH08ASZ	0.0	0.0
ZCH09CT2	0.0	0.0
...	...	...
ZHZ50G0W	NaN	NaN
ZHZ5W6L4	NaN	NaN
ZL005C4H	NaN	NaN
ZLW0G6FJ	NaN	NaN
ZLW0GPC5	NaN	NaN

272 rows × 2 columns

Avg. Improvement = -0.5310304881984288

 peek at drive with most improvement

	mse_baseline	lasso_mse
date
2020-07-01	NaN	NaN
2020-07-02	NaN	NaN
2020-07-03	NaN	NaN
2020-07-04	NaN	NaN
2020-07-05	NaN	NaN
2020-07-06	0.0	0.0
2020-07-07	0.0	0.0
2020-07-08	0.0	0.0
2020-07-09	0.0	0.0
2020-07-10	0.0	0.0
2020-07-11	0.0	0.0
2020-07-12	0.0	0.0
2020-07-13	0.0	0.0
2020-07-14	0.0	0.0
2020-07-15	0.0	0.0
2020-07-16	0.0	0.0
2020-07-17	0.0	0.0
2020-07-18	0.0	0.0
2020-07-19	0.0	0.0
2020-07-20	0.0	0.0
2020-07-21	0.0	0.0
2020-07-22	0.0	0.0
2020-07-23	0.0	0.0
2020-07-24	0.0	0.0
2020-07-25	0.0	0.0
2020-07-26	0.0	0.0
2020-07-27	0.0	0.0
2020-07-28	0.0	0.0
2020-07-29	0.0	0.0
2020-07-30	NaN	NaN

Bayesian Ridge Regression

[82]

windowwise_mses_br = failed_ts_df.groupby("serial_number").apply(get_br_mses_windowwise)
calc_improvement(all_mses, windowwise_mses_br, "br_mse")

Avg. MSE for each drive:

	mse_baseline	br_mse
serial_number
ZCH07G29	293.335593	232.366355
S301GMY2	15.288542	14.023308
Z302SYFG	6.602469	5.667974
ZA1815CT	0.000000	0.000000
ZCH08ASZ	0.000000	0.000000
...	...	...
ZHZ50G0W	NaN	NaN
ZHZ5W6L4	NaN	NaN
ZL005C4H	NaN	NaN
ZLW0G6FJ	NaN	NaN
ZLW0GPC5	NaN	NaN

272 rows × 2 columns

Avg. Improvement = 0.03253268716345691

 peek at drive with most improvement

	mse_baseline	br_mse
date
2020-07-01	NaN	NaN
2020-07-02	NaN	NaN
2020-07-03	NaN	NaN
2020-07-04	NaN	NaN
2020-07-05	NaN	NaN
...	...	...
2020-08-30	3223.0	2138.633687
2020-08-31	4168.0	2939.613471
2020-09-01	324.0	785.562136
2020-09-02	0.0	12.482166
2020-09-03	NaN	NaN

65 rows × 2 columns

ARIMA

[83]

windowwise_mses_arima = failed_ts_df.groupby("serial_number").apply(
    get_arima_mses_windowwise
)
calc_improvement(all_mses, windowwise_mses_arima, "arima_mse")

Avg. MSE for each drive:

	mse_baseline	arima_mse
serial_number
S301FDQW	368.038690	343.190154
ZCH07G29	293.335593	271.199328
S301GMY2	15.288542	15.239172
S300Z4TZ	41.323291	41.323277
Z3052KFM	12.445833	12.445831
...	...	...
ZHZ50G0W	NaN	NaN
ZHZ5W6L4	NaN	NaN
ZL005C4H	NaN	NaN
ZLW0G6FJ	NaN	NaN
ZLW0GPC5	NaN	NaN

272 rows × 2 columns

Avg. Improvement = 0.07091948293105635

 peek at drive with most improvement

	mse_baseline	arima_mse
date
2020-07-01	NaN	NaN
2020-07-02	NaN	NaN
2020-07-03	NaN	NaN
2020-07-04	NaN	NaN
2020-07-05	NaN	NaN
...	...	...
2020-08-27	4226.0	4225.999570
2020-08-28	5352.0	5130.782489
2020-08-29	2290.0	1346.610268
2020-08-30	324.0	97.089651
2020-08-31	NaN	NaN

62 rows × 2 columns

SARIMAX

[84]

windowwise_mses_sarimax = failed_ts_df.groupby("serial_number").apply(
    get_sarimax_mses_windowwise
)
calc_improvement(all_mses, windowwise_mses_sarimax, "sarimax_mse")

Avg. MSE for each drive:

	mse_baseline	sarimax_mse
serial_number
S301FDQW	368.038690	343.190154
ZCH07G29	293.335593	271.199328
S301GMY2	15.288542	15.239172
S300Z4TZ	41.323291	41.323277
Z3052KFM	12.445833	12.445831
...	...	...
ZHZ50G0W	NaN	NaN
ZHZ5W6L4	NaN	NaN
ZL005C4H	NaN	NaN
ZLW0G6FJ	NaN	NaN
ZLW0GPC5	NaN	NaN

272 rows × 2 columns

Avg. Improvement = 0.07091948293105635

 peek at drive with most improvement

	mse_baseline	sarimax_mse
date
2020-07-01	NaN	NaN
2020-07-02	NaN	NaN
2020-07-03	NaN	NaN
2020-07-04	NaN	NaN
2020-07-05	NaN	NaN
...	...	...
2020-08-27	4226.0	4225.999570
2020-08-28	5352.0	5130.782489
2020-08-29	2290.0	1346.610268
2020-08-30	324.0	97.089651
2020-08-31	NaN	NaN

62 rows × 2 columns

Prophet

[85]

windowwise_mses_prophet = failed_ts_df.groupby("serial_number").apply(
    get_prophet_mses_windowwise
)
calc_improvement(all_mses, windowwise_mses_prophet, "prophet_mse")

Avg. MSE for each drive:

	mse_baseline	prophet_mse
serial_number
ZCH07G29	293.335593	212.247761
S301GMY2	15.288542	11.689198
S301FDQW	368.038690	364.956980
Z302SYFG	6.602469	5.582433
S300Z4TZ	41.323291	40.844045
...	...	...
ZHZ50G0W	NaN	NaN
ZHZ5W6L4	NaN	NaN
ZL005C4H	NaN	NaN
ZLW0G6FJ	NaN	NaN
ZLW0GPC5	NaN	NaN

272 rows × 2 columns

Avg. Improvement = 0.22124409291941446

 peek at drive with most improvement

	mse_baseline	prophet_mse
date
2020-07-01	NaN	NaN
2020-07-02	NaN	NaN
2020-07-03	NaN	NaN
2020-07-04	NaN	NaN
2020-07-05	NaN	NaN
...	...	...
2020-08-30	3223.0	2090.197908
2020-08-31	4168.0	2882.460092
2020-09-01	324.0	574.886147
2020-09-02	0.0	36.757403
2020-09-03	NaN	NaN

65 rows × 2 columns

From the above experiments it seems like the Prophet model outperforms the rest for SMART 197 metric. However, a simple OLS model works quite well too with its average MSE not too far away from that of Prophet.

Metric SMART 198

In the following cells, we will predict the future values of the SMART 198 Metric. SMART 198 specifies the total count of uncorrectable errors when reading/writing a sector. A rise in the value of this attribute indicates defects of the disk surface and/or problems in the mechanical subsystem.

[86]

# training setup config
Y_COLS = ["smart_198_normalized"]
y_smart_stats = set([col.split("_")[1] for col in Y_COLS])

X_COLS = [c for c in smart_cols if c.split("_")[1] not in y_smart_stats]

NDAYS_DATA = 6
NDAYS_TO_PREDICT = 6

[87]

# calculate baseline
baseline_mses = failed_ts_df.groupby("serial_number").apply(
    get_baseline_mses_windowwise
)

# df to store all results
all_mses = baseline_mses.to_frame("mse_baseline")

# average mse across days
print("Average MSE for each drive:\n\n", baseline_mses.mean(level=0), end="\n\n")

Average MSE for each drive:

 serial_number
S300WFGR           NaN
S300Z4TZ     41.323291
S300Z7P1      0.000000
S301FDQW    368.038690
S301GMWQ      0.000000
               ...    
ZLW0G6CE      0.000000
ZLW0G6FJ           NaN
ZLW0GK7E      0.000000
ZLW0GKE4      0.000000
ZLW0GPC5           NaN
Length: 272, dtype: float64

[88]

visualize_mses(baseline_mses, 5)

INFO:matplotlib.category:Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
INFO:matplotlib.category:Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.

INFO:matplotlib.category:Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
INFO:matplotlib.category:Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.

INFO:matplotlib.category:Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
INFO:matplotlib.category:Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.

INFO:matplotlib.category:Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
INFO:matplotlib.category:Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.

INFO:matplotlib.category:Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
INFO:matplotlib.category:Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.

[89]

# calculate baseline for healthy drives
baseline_mses_working = working_ts_df.groupby("serial_number").apply(
    get_baseline_mses_windowwise
)

# df to store all results
all_mses_working = baseline_mses_working.to_frame("mse_baseline")

# average mse across days
print("Average MSE for each drive:\n\n")
display(baseline_mses_working.mean(level=0))

Average MSE for each drive:

serial_number
S300VKRB    0.0
S300VLKS    0.0
S300WE24    0.0
S300WE6K    0.0
S300WEAV    0.0
           ... 
ZHZ656DW    0.0
ZHZ65C16    0.0
ZJV009E3    0.0
ZJV00AR3    0.0
ZR50121P    NaN
Length: 136, dtype: float64

[90]

# what is the WORST average mse using baseline strategy
baseline_mses_working.mean(level=0).max()

0.0

Linear Regression

[91]

windowwise_mses_linear = failed_ts_df.groupby("serial_number").apply(
    get_linreg_mses_windowwise
)
calc_improvement(all_mses, windowwise_mses_linear, "linreg_mse")

Avg. MSE for each drive:

	mse_baseline	linreg_mse
serial_number
ZCH07G29	293.335593	224.181284
S301GMY2	15.288542	13.293533
Z302SYFG	6.602469	5.297197
ZA1815CT	0.000000	0.000000
ZCH08ASZ	0.000000	0.000000
...	...	...
ZHZ50G0W	NaN	NaN
ZHZ5W6L4	NaN	NaN
ZL005C4H	NaN	NaN
ZLW0G6FJ	NaN	NaN
ZLW0GPC5	NaN	NaN

272 rows × 2 columns

Avg. Improvement = 0.12337602853312885

 peek at drive with most improvement

	mse_baseline	linreg_mse
date
2020-07-01	NaN	NaN
2020-07-02	NaN	NaN
2020-07-03	NaN	NaN
2020-07-04	NaN	NaN
2020-07-05	NaN	NaN
...	...	...
2020-08-30	3223.0	2088.944444
2020-08-31	4168.0	2882.342404
2020-09-01	324.0	572.070975
2020-09-02	0.0	37.617778
2020-09-03	NaN	NaN

65 rows × 2 columns

Lasso

[92]

windowwise_mses_lasso = failed_ts_df.groupby("serial_number").apply(
    get_lasso_mses_windowwise
)
calc_improvement(all_mses, windowwise_mses_lasso, "lasso_mse")

Avg. MSE for each drive:

	mse_baseline	lasso_mse
serial_number
ZA1815CT	0.0	0.0
ZCH088QQ	0.0	0.0
ZCH08AGH	0.0	0.0
ZCH08ASZ	0.0	0.0
ZCH09CT2	0.0	0.0
...	...	...
ZHZ50G0W	NaN	NaN
ZHZ5W6L4	NaN	NaN
ZL005C4H	NaN	NaN
ZLW0G6FJ	NaN	NaN
ZLW0GPC5	NaN	NaN

272 rows × 2 columns

Avg. Improvement = -0.5310304881984288

 peek at drive with most improvement

	mse_baseline	lasso_mse
date
2020-07-01	NaN	NaN
2020-07-02	NaN	NaN
2020-07-03	NaN	NaN
2020-07-04	NaN	NaN
2020-07-05	NaN	NaN
2020-07-06	0.0	0.0
2020-07-07	0.0	0.0
2020-07-08	0.0	0.0
2020-07-09	0.0	0.0
2020-07-10	0.0	0.0
2020-07-11	0.0	0.0
2020-07-12	0.0	0.0
2020-07-13	0.0	0.0
2020-07-14	0.0	0.0
2020-07-15	0.0	0.0
2020-07-16	0.0	0.0
2020-07-17	0.0	0.0
2020-07-18	0.0	0.0
2020-07-19	0.0	0.0
2020-07-20	0.0	0.0
2020-07-21	0.0	0.0
2020-07-22	0.0	0.0
2020-07-23	0.0	0.0
2020-07-24	0.0	0.0
2020-07-25	0.0	0.0
2020-07-26	0.0	0.0
2020-07-27	0.0	0.0
2020-07-28	0.0	0.0
2020-07-29	0.0	0.0
2020-07-30	NaN	NaN

Bayesian Ridge Regression

[93]

windowwise_mses_br = failed_ts_df.groupby("serial_number").apply(get_br_mses_windowwise)
calc_improvement(all_mses, windowwise_mses_br, "br_mse")

Avg. MSE for each drive:

	mse_baseline	br_mse
serial_number
ZCH07G29	293.335593	232.366355
S301GMY2	15.288542	14.023308
Z302SYFG	6.602469	5.667974
ZA1815CT	0.000000	0.000000
ZCH08ASZ	0.000000	0.000000
...	...	...
ZHZ50G0W	NaN	NaN
ZHZ5W6L4	NaN	NaN
ZL005C4H	NaN	NaN
ZLW0G6FJ	NaN	NaN
ZLW0GPC5	NaN	NaN

272 rows × 2 columns

Avg. Improvement = 0.03253268716345691

 peek at drive with most improvement

	mse_baseline	br_mse
date
2020-07-01	NaN	NaN
2020-07-02	NaN	NaN
2020-07-03	NaN	NaN
2020-07-04	NaN	NaN
2020-07-05	NaN	NaN
...	...	...
2020-08-30	3223.0	2138.633687
2020-08-31	4168.0	2939.613471
2020-09-01	324.0	785.562136
2020-09-02	0.0	12.482166
2020-09-03	NaN	NaN

65 rows × 2 columns

ARIMA

[94]

windowwise_mses_arima = failed_ts_df.groupby("serial_number").apply(
    get_arima_mses_windowwise
)
calc_improvement(all_mses, windowwise_mses_arima, "arima_mse")

Avg. MSE for each drive:

	mse_baseline	arima_mse
serial_number
S301FDQW	368.038690	343.190154
ZCH07G29	293.335593	271.199328
S301GMY2	15.288542	15.239172
S300Z4TZ	41.323291	41.323277
Z3052KFM	12.445833	12.445831
...	...	...
ZHZ50G0W	NaN	NaN
ZHZ5W6L4	NaN	NaN
ZL005C4H	NaN	NaN
ZLW0G6FJ	NaN	NaN
ZLW0GPC5	NaN	NaN

272 rows × 2 columns

Avg. Improvement = 0.07091948293105635

 peek at drive with most improvement

	mse_baseline	arima_mse
date
2020-07-01	NaN	NaN
2020-07-02	NaN	NaN
2020-07-03	NaN	NaN
2020-07-04	NaN	NaN
2020-07-05	NaN	NaN
...	...	...
2020-08-27	4226.0	4225.999570
2020-08-28	5352.0	5130.782489
2020-08-29	2290.0	1346.610268
2020-08-30	324.0	97.089651
2020-08-31	NaN	NaN

62 rows × 2 columns

SARIMAX

[95]

windowwise_mses_sarimax = failed_ts_df.groupby("serial_number").apply(
    get_sarimax_mses_windowwise
)
calc_improvement(all_mses, windowwise_mses_sarimax, "sarimax_mse")

Avg. MSE for each drive:

	mse_baseline	sarimax_mse
serial_number
S301FDQW	368.038690	343.190154
ZCH07G29	293.335593	271.199328
S301GMY2	15.288542	15.239172
S300Z4TZ	41.323291	41.323277
Z3052KFM	12.445833	12.445831
...	...	...
ZHZ50G0W	NaN	NaN
ZHZ5W6L4	NaN	NaN
ZL005C4H	NaN	NaN
ZLW0G6FJ	NaN	NaN
ZLW0GPC5	NaN	NaN

272 rows × 2 columns

Avg. Improvement = 0.07091948293105635

 peek at drive with most improvement

	mse_baseline	sarimax_mse
date
2020-07-01	NaN	NaN
2020-07-02	NaN	NaN
2020-07-03	NaN	NaN
2020-07-04	NaN	NaN
2020-07-05	NaN	NaN
...	...	...
2020-08-27	4226.0	4225.999570
2020-08-28	5352.0	5130.782489
2020-08-29	2290.0	1346.610268
2020-08-30	324.0	97.089651
2020-08-31	NaN	NaN

62 rows × 2 columns

Prophet

[96]

windowwise_mses_prophet = failed_ts_df.groupby("serial_number").apply(
    get_prophet_mses_windowwise
)
calc_improvement(all_mses, windowwise_mses_prophet, "prophet_mse")

Avg. MSE for each drive:

	mse_baseline	prophet_mse
serial_number
ZCH07G29	293.335593	212.247761
S301GMY2	15.288542	11.689198
S301FDQW	368.038690	364.956980
Z302SYFG	6.602469	5.582433
S300Z4TZ	41.323291	40.844045
...	...	...
ZHZ50G0W	NaN	NaN
ZHZ5W6L4	NaN	NaN
ZL005C4H	NaN	NaN
ZLW0G6FJ	NaN	NaN
ZLW0GPC5	NaN	NaN

272 rows × 2 columns

Avg. Improvement = 0.22124409291941446

 peek at drive with most improvement

	mse_baseline	prophet_mse
date
2020-07-01	NaN	NaN
2020-07-02	NaN	NaN
2020-07-03	NaN	NaN
2020-07-04	NaN	NaN
2020-07-05	NaN	NaN
...	...	...
2020-08-30	3223.0	2090.197908
2020-08-31	4168.0	2882.460092
2020-09-01	324.0	574.886147
2020-09-02	0.0	36.757403
2020-09-03	NaN	NaN

65 rows × 2 columns

From the above experiments it seems like the Prophet model outperforms the rest for SMART 198 metric. However, a simple OLS model works quite well too with its average MSE not too far away from that of Prophet.

Conclusion

In this notebook, we tried to predict future values of SMART 5, 187, 188, 197, 198 metrics using different machine learning models. In general, we found out the ml models give better results than the baseline models. Prophet models outperform other models for SMART 5, 197, 198 metrics. However, no single model performed better than the baseline model for all the metrics. Hence, based on metrics, we will select different ml models.