Step 0: Know Thy Data Format

At Red Hat, Backblaze Q4 2018 dataset files (.csv) were downloaded and converted to look like the json output from smartctl command. These json files are quite complex structure (deeply nested). The purpose of this notebook is to identify and describe this structure/schema.

NOTE: If you are not using the json-files-version of Bacblaze dataset at Red Hat, this notebook may not relevant for you.

The Q4 2018 dataset consists of 92 json files. Each file contains features like SMART stats, model number, capacity, etc for ~96000 hard drives. Assuming that format is consistent across all 92 files, only one such file is needed to decipher the dataset schema.

[1]

import pandas as pd

[12]

# reading the entire file is slow, and not very useful for this task
# so read in a small chunk of it
CHUNK_SIZE = 10000
FILE_PATH = "./smart_data_data_Q4_2018_2018-10-01.json"
df_reader = pd.read_json(FILE_PATH, lines=True, chunksize=CHUNK_SIZE)

# get the first chunk
chunk = next(df_reader)
chunk.head()

	hints	smartctl_json
0	{'is_backblaze': True, 'backblaze_ts': 1538352...	{'model_name': 'ST4000DM000', 'serial_number':...
1	{'is_backblaze': True, 'backblaze_ts': 1538352...	{'model_name': 'ST12000NM0007', 'serial_number...
2	{'is_backblaze': True, 'backblaze_ts': 1538352...	{'model_name': 'ST12000NM0007', 'serial_number...
3	{'is_backblaze': True, 'backblaze_ts': 1538352...	{'model_name': 'HGST HMS5C4040ALE640', 'serial...
4	{'is_backblaze': True, 'backblaze_ts': 1538352...	{'model_name': 'ST8000NM0055', 'serial_number'...

[15]

# split the dataset into data and corresponding labels for inferring schema
data = chunk["smartctl_json"].apply(pd.Series)
labels = chunk["hints"].apply(pd.Series)

[18]

print("data shape =", data.shape)
data.head()

data shape = (10000, 5)

	model_name	serial_number	model_family	user_capacity	ata_smart_attributes
0	ST4000DM000	Z305B2QN	ST4000DM000	{'bytes': 4000787030016}	{'table': [{'id': 1, 'value': 117, 'raw': {'va...
1	ST12000NM0007	ZJV0XJQ4	ST12000NM0007	{'bytes': 12000138625024}	{'table': [{'id': 1, 'value': 68, 'raw': {'val...
2	ST12000NM0007	ZJV0XJQ0	ST12000NM0007	{'bytes': 12000138625024}	{'table': [{'id': 1, 'value': 79, 'raw': {'val...
3	HGST HMS5C4040ALE640	PL1331LAHG1S4H	HGST HMS5C4040ALE640	{'bytes': 4000787030016}	{'table': [{'id': 1, 'value': 100, 'raw': {'va...
4	ST8000NM0055	ZA16NQJR	ST8000NM0055	{'bytes': 8001563222016}	{'table': [{'id': 1, 'value': 80, 'raw': {'val...

[19]

print("labels shape =", labels.shape)
labels.head()

labels shape = (10000, 3)

	is_backblaze	backblaze_ts	backblaze_failure_label
0	True	1.538352e+12	False
1	True	1.538352e+12	False
2	True	1.538352e+12	False
3	True	1.538352e+12	False
4	True	1.538352e+12	False

[26]

# ata_smart_attributes seems to be a dictionary with multiple values. lets dive deeper into it
# this section assumes key/value structure does not change across rows
data["ata_smart_attributes"][0].keys()

dict_keys(['table'])

[27]

# what does the value at table look like
data["ata_smart_attributes"][0]["table"]

[{'id': 1, 'value': 117, 'raw': {'value': 148579464, 'string': 148579464}},
 {'id': 3, 'value': 91, 'raw': {'value': 0, 'string': 0}},
 {'id': 4, 'value': 100, 'raw': {'value': 12, 'string': 12}},
 {'id': 5, 'value': 100, 'raw': {'value': 0, 'string': 0}},
 {'id': 7, 'value': 82, 'raw': {'value': 167981075, 'string': 167981075}},
 {'id': 9, 'value': 73, 'raw': {'value': 24506, 'string': 24506}},
 {'id': 10, 'value': 100, 'raw': {'value': 0, 'string': 0}},
 {'id': 12, 'value': 100, 'raw': {'value': 12, 'string': 12}},
 {'id': 183, 'value': 100.0, 'raw': {'value': 0.0, 'string': 0.0}},
 {'id': 184, 'value': 100.0, 'raw': {'value': 0.0, 'string': 0.0}},
 {'id': 187, 'value': 100.0, 'raw': {'value': 0.0, 'string': 0.0}},
 {'id': 188, 'value': 100.0, 'raw': {'value': 0.0, 'string': 0.0}},
 {'id': 189, 'value': 100.0, 'raw': {'value': 0.0, 'string': 0.0}},
 {'id': 190, 'value': 78.0, 'raw': {'value': 22.0, 'string': 22.0}},
 {'id': 191, 'value': 100.0, 'raw': {'value': 0.0, 'string': 0.0}},
 {'id': 192, 'value': 100.0, 'raw': {'value': 0.0, 'string': 0.0}},
 {'id': 193, 'value': 83.0, 'raw': {'value': 34164.0, 'string': 34164.0}},
 {'id': 194, 'value': 22, 'raw': {'value': 22, 'string': 22}},
 {'id': 197, 'value': 100, 'raw': {'value': 0, 'string': 0}},
 {'id': 198, 'value': 100, 'raw': {'value': 0, 'string': 0}},
 {'id': 199, 'value': 200, 'raw': {'value': 0, 'string': 0}},
 {'id': 240, 'value': 100.0, 'raw': {'value': 24254.0, 'string': 24254.0}},
 {'id': 241,
  'value': 100.0,
  'raw': {'value': 43695026832.0, 'string': 43695026832.0}},
 {'id': 242,
  'value': 100.0,
  'raw': {'value': 102475323768.0, 'string': 102475323768.0}}]

Results - Anatomy of a json file

The dataset is made up of 92 such json files

Each json file has two columns (keys) - hints and smartctl_json

In hints column, each row is a json with keys backblaze_failure_label, is_backblaze, backblaze_ts. This means each of the ~96000 hard drives has values corresponding to these three keys.

In smartctl_json column, each row is a json with keys ata_smart_attributes, model_name, serial_number, model_family, user_capacity.

ata_smart_attributes itself is a json with ONLY ONE KEY - table. The value at table is a list. Each member of the list is a json with keys id, value, raw. id is the SMART stat number, value is its normalized value (note: not sure how this is normalized), raw is a json with keys value, string (its raw value as an int, and as a string).

user_capacity is another json with ONLY ONE KEY - bytes.

[ ]