ODH Logo

Step 0: Know Thy Data Format

At Red Hat, Backblaze Q4 2018 dataset files (.csv) were downloaded and converted to look like the json output from smartctl command. These json files are quite complex structure (deeply nested). The purpose of this notebook is to identify and describe this structure/schema.

NOTE: If you are not using the json-files-version of Bacblaze dataset at Red Hat, this notebook may not relevant for you.

The Q4 2018 dataset consists of 92 json files. Each file contains features like SMART stats, model number, capacity, etc for ~96000 hard drives. Assuming that format is consistent across all 92 files, only one such file is needed to decipher the dataset schema.

[1]
import pandas as pd
[12]
# reading the entire file is slow, and not very useful for this task
# so read in a small chunk of it
CHUNK_SIZE = 10000
FILE_PATH = "./smart_data_data_Q4_2018_2018-10-01.json"
df_reader = pd.read_json(FILE_PATH, lines=True, chunksize=CHUNK_SIZE)

# get the first chunk
chunk = next(df_reader)
chunk.head()
hints smartctl_json
0 {'is_backblaze': True, 'backblaze_ts': 1538352... {'model_name': 'ST4000DM000', 'serial_number':...
1 {'is_backblaze': True, 'backblaze_ts': 1538352... {'model_name': 'ST12000NM0007', 'serial_number...
2 {'is_backblaze': True, 'backblaze_ts': 1538352... {'model_name': 'ST12000NM0007', 'serial_number...
3 {'is_backblaze': True, 'backblaze_ts': 1538352... {'model_name': 'HGST HMS5C4040ALE640', 'serial...
4 {'is_backblaze': True, 'backblaze_ts': 1538352... {'model_name': 'ST8000NM0055', 'serial_number'...
[15]
# split the dataset into data and corresponding labels for inferring schema
data = chunk["smartctl_json"].apply(pd.Series)
labels = chunk["hints"].apply(pd.Series)
[18]
print("data shape =", data.shape)
data.head()
data shape = (10000, 5)
model_name serial_number model_family user_capacity ata_smart_attributes
0 ST4000DM000 Z305B2QN ST4000DM000 {'bytes': 4000787030016} {'table': [{'id': 1, 'value': 117, 'raw': {'va...
1 ST12000NM0007 ZJV0XJQ4 ST12000NM0007 {'bytes': 12000138625024} {'table': [{'id': 1, 'value': 68, 'raw': {'val...
2 ST12000NM0007 ZJV0XJQ0 ST12000NM0007 {'bytes': 12000138625024} {'table': [{'id': 1, 'value': 79, 'raw': {'val...
3 HGST HMS5C4040ALE640 PL1331LAHG1S4H HGST HMS5C4040ALE640 {'bytes': 4000787030016} {'table': [{'id': 1, 'value': 100, 'raw': {'va...
4 ST8000NM0055 ZA16NQJR ST8000NM0055 {'bytes': 8001563222016} {'table': [{'id': 1, 'value': 80, 'raw': {'val...
[19]
print("labels shape =", labels.shape)
labels.head()
labels shape = (10000, 3)
is_backblaze backblaze_ts backblaze_failure_label
0 True 1.538352e+12 False
1 True 1.538352e+12 False
2 True 1.538352e+12 False
3 True 1.538352e+12 False
4 True 1.538352e+12 False
[26]
# ata_smart_attributes seems to be a dictionary with multiple values. lets dive deeper into it
# this section assumes key/value structure does not change across rows
data["ata_smart_attributes"][0].keys()
dict_keys(['table'])
[27]
# what does the value at table look like
data["ata_smart_attributes"][0]["table"]
[{'id': 1, 'value': 117, 'raw': {'value': 148579464, 'string': 148579464}},
 {'id': 3, 'value': 91, 'raw': {'value': 0, 'string': 0}},
 {'id': 4, 'value': 100, 'raw': {'value': 12, 'string': 12}},
 {'id': 5, 'value': 100, 'raw': {'value': 0, 'string': 0}},
 {'id': 7, 'value': 82, 'raw': {'value': 167981075, 'string': 167981075}},
 {'id': 9, 'value': 73, 'raw': {'value': 24506, 'string': 24506}},
 {'id': 10, 'value': 100, 'raw': {'value': 0, 'string': 0}},
 {'id': 12, 'value': 100, 'raw': {'value': 12, 'string': 12}},
 {'id': 183, 'value': 100.0, 'raw': {'value': 0.0, 'string': 0.0}},
 {'id': 184, 'value': 100.0, 'raw': {'value': 0.0, 'string': 0.0}},
 {'id': 187, 'value': 100.0, 'raw': {'value': 0.0, 'string': 0.0}},
 {'id': 188, 'value': 100.0, 'raw': {'value': 0.0, 'string': 0.0}},
 {'id': 189, 'value': 100.0, 'raw': {'value': 0.0, 'string': 0.0}},
 {'id': 190, 'value': 78.0, 'raw': {'value': 22.0, 'string': 22.0}},
 {'id': 191, 'value': 100.0, 'raw': {'value': 0.0, 'string': 0.0}},
 {'id': 192, 'value': 100.0, 'raw': {'value': 0.0, 'string': 0.0}},
 {'id': 193, 'value': 83.0, 'raw': {'value': 34164.0, 'string': 34164.0}},
 {'id': 194, 'value': 22, 'raw': {'value': 22, 'string': 22}},
 {'id': 197, 'value': 100, 'raw': {'value': 0, 'string': 0}},
 {'id': 198, 'value': 100, 'raw': {'value': 0, 'string': 0}},
 {'id': 199, 'value': 200, 'raw': {'value': 0, 'string': 0}},
 {'id': 240, 'value': 100.0, 'raw': {'value': 24254.0, 'string': 24254.0}},
 {'id': 241,
  'value': 100.0,
  'raw': {'value': 43695026832.0, 'string': 43695026832.0}},
 {'id': 242,
  'value': 100.0,
  'raw': {'value': 102475323768.0, 'string': 102475323768.0}}]

Results - Anatomy of a json file

  1. The dataset is made up of 92 such json files
  1. Each json file has two columns (keys) - hints and smartctl_json
  1. In hints column, each row is a json with keys backblaze_failure_label, is_backblaze, backblaze_ts. This means each of the ~96000 hard drives has values corresponding to these three keys.
  1. In smartctl_json column, each row is a json with keys ata_smart_attributes, model_name, serial_number, model_family, user_capacity.
  1. ata_smart_attributes itself is a json with ONLY ONE KEY - table. The value at table is a list. Each member of the list is a json with keys id, value, raw. id is the SMART stat number, value is its normalized value (note: not sure how this is normalized), raw is a json with keys value, string (its raw value as an int, and as a string).
  1. user_capacity is another json with ONLY ONE KEY - bytes.
[ ]