Ceph Telemetry Data EDA
The Ceph team recently started collecting disk data provided by their users. This notebook shows how to access that data, and explores the content of the dataset on a high level. This notebook also looks to contrast this Telemetry dataset to the previously explored BackBlaze data. The data is available here.
import wget
import requests
import json
from tqdm import tqdm
import zipfile
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
sns.set(rc={"figure.figsize": (11.7, 8.27)})
Loading the data
First, we'll look at the different data sets we are able to download.
url = "https://kzn-swift.massopen.cloud/swift/v1/devicehealth/"
r = requests.get(url, allow_redirects=True)
files = r.text.split(sep="\n")
files
['device_health_metrics_2020-01.zip',
'device_health_metrics_2020-02.zip',
'device_health_metrics_2020-03.zip',
'device_health_metrics_2020-04.zip',
'device_health_metrics_2020-05.zip',
'device_health_metrics_2020-06.zip',
'device_health_metrics_2020-07.zip',
'device_health_metrics_2020-08.zip',
'device_health_metrics_2020-09.zip',
'device_health_metrics_2020-10.zip',
'device_health_metrics_2020-11.zip',
'device_health_metrics_2020-12.zip',
'device_health_metrics_2021-01.zip',
'device_health_metrics_2021-02.zip',
'device_health_metrics_2021-03.zip',
'device_health_metrics_2021-04.zip',
'device_health_metrics_2021-05.zip',
'device_health_metrics_2021-06.zip']
file_name = "device_health_metrics_2020-01"
url = url + f"{file_name}.zip"
zipped_data = wget.download(url)
with zipfile.ZipFile(f"{file_name}.zip", "r") as zip_ref:
zip_ref.extractall("../../../data/telemetry")
health_data = pd.read_csv(f"../../../data/telemetry/{file_name}.csv")
health_data.head()
ts | device_uuid | invalid | report | |
---|---|---|---|---|
0 | 2020-01-20 15:01:17 | 30c613da-3ee6-11ea-afb2-0025900057ea | t | {"dev": "/dev/sdb", "error": "smartctl failed"... |
1 | 2020-01-20 15:01:18 | 30b5dc0e-3ee6-11ea-afb2-0025900057ea | t | {"dev": "/dev/sdc", "error": "smartctl failed"... |
2 | 2020-01-20 15:01:19 | 30a59e48-3ee6-11ea-afb2-0025900057ea | t | {"dev": "/dev/sdd", "error": "smartctl failed"... |
3 | 2020-01-20 15:01:20 | 30b34b24-3ee6-11ea-afb2-0025900057ea | t | {"dev": "/dev/sde", "error": "smartctl failed"... |
4 | 2020-01-20 15:01:20 | 30bcc5c8-3ee6-11ea-afb2-0025900057ea | t | {"dev": "/dev/sdf", "error": "smartctl failed"... |
Exploring dataset contents
We first are interested in understanding how many devices we are looking at, and how many data points are associated with each device.
# how many devices do we have data from?
health_data["device_uuid"].nunique()
89
# how many data points do we have from device?
health_data["device_uuid"].value_counts()
30a81844-3ee6-11ea-afb2-0025900057ea 17
30c613da-3ee6-11ea-afb2-0025900057ea 17
30aef7c2-3ee6-11ea-afb2-0025900057ea 16
30c88304-3ee6-11ea-afb2-0025900057ea 16
30b5dc0e-3ee6-11ea-afb2-0025900057ea 16
..
0e8d2da2-4390-11ea-8497-0cc47a635394 4
5c058d0a-4459-11ea-a135-0cc47ad2c770 4
5c0dbfb6-4459-11ea-a135-0cc47ad2c770 4
5bfae076-4459-11ea-a135-0cc47ad2c770 4
a26045dd-40a4-11ea-aeb4-002590005994 1
Name: device_uuid, Length: 89, dtype: int64
# how many of these data points had valid data?
health_data.invalid.value_counts()
f 792
t 310
Name: invalid, dtype: int64
RESULT Based on the above outputs, it looks like we have data from 89 unique devices. The number of data points from each device ranges from 1 to 17. Furthermore, roughly 72% of these data points have valid data.
# drop invalid data
health_data = health_data[health_data["invalid"] == "f"]
health_data.shape
(792, 4)
# convert json strings to python dicts
health_data["report"] = health_data["report"].apply(lambda x: json.loads(x))
# unroll device data column to get a flat df of features
unrolled_health_data = pd.json_normalize(health_data["report"])
unrolled_health_data.head()
vendor | host_id | product | revision | model_name | nvme_vendor | scsi_version | rotation_rate | logical_block_size | json_format_version | ... | nvme_smart_health_information_log.media_errors | nvme_smart_health_information_log.power_cycles | nvme_smart_health_information_log.power_on_hours | nvme_smart_health_information_log.data_units_read | nvme_smart_health_information_log.unsafe_shutdowns | nvme_smart_health_information_log.data_units_written | nvme_smart_health_information_log.num_err_log_entries | nvme_smart_health_information_log.controller_busy_time | nvme_total_capacity | nvme_unallocated_capacity | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Hitachi | 6942c3b2-3c97-11ea-aeb4-002590005994 | HUA722010CLA330 | R001 | Hitachi HUA722010CLA330 | hitachi | SPC-3 | 10000.0 | 512 | [1, 0] | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | Seagate | 30930e18-3ee6-11ea-afb2-0025900057ea | ST31000528AS | R001 | Seagate ST31000528AS | seagate | SPC-3 | 10000.0 | 512 | [1, 0] | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | Hitachi | 30957ee6-3ee6-11ea-afb2-0025900057ea | HUA722010CLA330 | R001 | Hitachi HUA722010CLA330 | hitachi | SPC-3 | 10000.0 | 512 | [1, 0] | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | Hitachi | 3099dcb6-3ee6-11ea-afb2-0025900057ea | HUA722010CLA330 | R001 | Hitachi HUA722010CLA330 | hitachi | SPC-3 | 10000.0 | 512 | [1, 0] | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | Hitachi | 30957ee6-3ee6-11ea-afb2-0025900057ea | HUA722010CLA330 | R001 | Hitachi HUA722010CLA330 | hitachi | SPC-3 | 10000.0 | 512 | [1, 0] | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 196 columns
# how many disks had smartctl run successfully
unrolled_health_data["smartctl.exit_status"].value_counts()
0 493
4 299
Name: smartctl.exit_status, dtype: int64
sns.countplot(
x="smartctl.exit_status",
data=unrolled_health_data,
order=unrolled_health_data["smartctl.exit_status"].value_counts().index,
)
plt.show()
RESULT From the above cell, it looks like for most of the data points, smartctl ran successfully with exit code 0 (no errors at all). For some, we had smartctl exit code 4 (i.e. bit 2 was raised), which means some smartctl attributes could not be fetched (as per docs here - https://linux.die.net/man/8/smartctl). In all cases, we have at least some valid smart attributes from each device.
# extract smart metrics
smart_metrics_df = unrolled_health_data["ata_smart_attributes.table"].to_frame()
# numerical index of column ata_smart_attributes
for row_idx in tqdm(range(len(smart_metrics_df))):
# get the smart stats for current drive
stats = smart_metrics_df.iloc[row_idx]["ata_smart_attributes.table"]
if isinstance(stats, list):
for stat in stats:
# extract normalized value, and int form of raw value
smart_metrics_df.at[
row_idx, "smart_" + str(stat["id"]) + "_normalized"
] = stat["value"]
smart_metrics_df.at[row_idx, "smart_" + str(stat["id"]) + "_raw"] = stat[
"raw"
]["value"]
smart_metrics_df.drop(columns=["ata_smart_attributes.table"], inplace=True)
smart_metrics_df.dropna(how="all").head()
100%|██████████| 792/792 [00:01<00:00, 534.69it/s]
smart_5_normalized | smart_5_raw | smart_9_normalized | smart_9_raw | smart_12_normalized | smart_12_raw | smart_177_normalized | smart_177_raw | smart_179_normalized | smart_179_raw | ... | smart_206_normalized | smart_206_raw | smart_210_normalized | smart_210_raw | smart_246_normalized | smart_246_raw | smart_247_normalized | smart_247_raw | smart_248_normalized | smart_248_raw | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
223 | 100.0 | 0.0 | 99.0 | 1136.0 | 99.0 | 2.0 | 99.0 | 1.0 | 100.0 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
226 | 100.0 | 0.0 | 99.0 | 1148.0 | 99.0 | 2.0 | 100.0 | 0.0 | 100.0 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
227 | 100.0 | 0.0 | 62.0 | 33997.0 | 100.0 | 50.0 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
228 | 100.0 | 0.0 | 100.0 | 2070.0 | 100.0 | 2.0 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
229 | 100.0 | 0.0 | 62.0 | 33997.0 | 100.0 | 51.0 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 104 columns
The telemetry
dataset differs from backblaze
since it includes different vendors, so we want to have an overview of the different vendors that exist in this dataset. Having a list of vendors can be useful for deeper exploration, and knowing how often each vendor is used can help us control for the popularity of each vendor when looking at failure counts.
unrolled_health_data["nvme_vendor"].value_counts()
hitachi 209
lvm 177
crucial_ct1024m550ssd1 138
st2000nx0253 50
na 49
seagate 37
wdc_wds200t2b0a-00sm50 35
samsung_ssd_860_qvo_4tb 27
wdc_wds400t2b0a-00sm50 25
samsung_ssd_850_evo_1tb 17
intel 16
samsung_ssd_860_evo_4tb 8
centos-0_ssd 4
Name: nvme_vendor, dtype: int64
sns.countplot(
x="nvme_vendor",
data=unrolled_health_data,
order=unrolled_health_data["nvme_vendor"].value_counts().index,
)
plt.xticks(rotation=90)
plt.show()
Our initial overview of the different vendors gives us a few different insights. One, it becomes apparent that there is still more data cleaning to be done as some vendor names are a mix of vendor name and disk model name. Secondk we see that "NA" is one of our most used vendors. It may be necessary to look into other columns for clues on the vendor, or find a different way to elegantly deal with NA values without reducing the integrity of our data.
We also want to are interested to see how often each disk model is used.
unrolled_health_data["model_name"].value_counts()
Hitachi HUA722010CLA330 185
WDC WDS200T2B0A-00SM50 155
Crucial_CT1024M550SSD1 138
ST2000NX0253 50
NA HUA721010KLA330 49
Samsung SSD 860 QVO 4TB 43
Seagate ST31000528AS 37
WDC WDS400T2B0A-00SM50 33
Samsung SSD 860 EVO 4TB 24
Samsung SSD 850 EVO 1TB 17
INTEL SSDPE2MX012T4 16
Hitachi HDS721010CLA330 12
Hitachi HUA721010KLA330 12
INTEL SSDPE2ME400G4 9
INTEL SSDPE2KX040T7 8
CentOS-0 SSD 4
Name: model_name, dtype: int64
sns.countplot(
x="model_name",
data=unrolled_health_data,
order=unrolled_health_data["model_name"].value_counts().index,
)
plt.xticks(rotation=90)
plt.show()
This can help us understand the overall distribution of different disks used, which look to be primarily Hitachi HUA722010CLA330, WDC WDS200T2B0A-00SM50, and Crucial_CT1024M550SSD1.
Conclusion
This dataset differs from backblaze
as we are no longer just limited to one data center. With a variety of users and their differentiated use cases, we are able to have a more diverse dataset. Along with an increase in diversity of vendors, we also have many 16 different disk models to analyze how they perform. Finally, this dataset has many more variables for smartcl, allowing us to better understand the hard disk information that smartctl provides.