This project consists of the following main workstreams:
- Disk Health Predictor Library
In this workstream, we address the primary objective of the project i.e. creating disk failure prediction models using open source datasets. Specifically, we train classification models to classify a given disk’s health into one of three categories -
good (>6 weeks till failure),
warning (2-6 weeks till failure), and
bad (<2 weeks till failure). These disk health categories were defined this way to be consistent with the existing setup in Ceph provided by DiskProphet, a disk health prediction solution from ProphetStor. The input for these models is 6 days of SMART data, which was also a design choice made to be compatible with the existing setup.
The data used for training the models is the Backblaze dataset. It consists of SMART data collected daily from the hard disk drives in the Backblaze datacenter, along with a label indicating whether or not the drive was considered failed.
The following notebooks were created as a part of this workstream:
smartctl_json_db_format_finder: Understand the structure of json outpu by the
smartctl_json_db_to_df: Convert a nested
smartctljson to a pandas dataframe.
data_explorer: Explore the contents and salient properties of the Backblaze dataset.
data_cleaner_seagate: Clean data available for seagate disks.
data_cleaner_hgst: Clean data available for hgst disks.
clustering_and_binaryclf: Explore clustering models and binary pass/fail classifiers.
ternary_clf: Explore ternary classifiers, i.e. models that classify disk health into “good”, “warning”, and “bad” as described above.
kaggle_seagate_end2end: Entire ML pipeline, starting from data cleaning to feature engineering to model training, for seagate disks. Combines the results from each notebook in the above sections.
kaggle_hgst_end2end: Entire ML pipeline, starting from data cleaning to feature engineering to model training, for hgst disks. Combines the results from each notebook in the above sections.
The goal of this workstream is to create models that can forecast the values of individual SMART metrics into the near future. The idea here is that these forecasting models could be used in tandem with the disk health classifier models from above. Together, they can provide a more granular and detailed insight into what specific component is likely to fail for a given disk. Based on this information, the storage system operator or subject matter expert can manually decide whether or not to remove a hard disk drive from the storage cluster or datacenter, based on their unique failure tolerance level.
In this initial setup, we treat each SMART metric as an independent variable to forecast. That is, we train univariate forecasting models for each (significant) SMART metric.
In this subsequent setup, we take into account the interactions between various SMART metrics. That is, we train multivariate forecasting models to predict how all the SMART metric values will change together in the near future.
Most of the work done in this project is based on the Backblaze dataset since it was the only large, publicly available, and well-curated dataset at the time. However, since this data is collected from just one company with a specific usage (backup), it might not be able to capture the various usage patterns of real Ceph users. Fortunately, there have been other recent efforts in academia and industry towards collecting more detailed disk health data, that help create better disk failure prediction models. Of these data sources, we investigate the following two in this workstream.
The Ceph team has been collecting anonymized SMART metrics from their users. We have worked with them to make this data publicly available. Furthermore, we have created an exploratory notebook that walks you through accessing this data, highlights the main features of this data, and compares it with the Backblaze dataset.
Recent research suggests that incorporating disk performance and disk location data with SMART metrics can be valuable in analyzing disk health. Specifically, this paper claims to achieve improvements in disk failure prediction models, when using these additional features. In this effort, we explore the FAST dataset and evaluate the tradeoffs between model performance gain and overhead of collecting additional metrics from users.
- Exploratory notebook (forthcoming)
The goal of this workstream is to create an open source python module containing the models trained in this project. The goal is to make these models easily accessible and usable by everyone, not just Ceph users. This way, anyone can run inference on their own storage system using the models built in this project.