Step 0: json to pandas-dataframe

At Red Hat, Backblaze Q4 2018 dataset files (.csv) were downloaded and converted to look like the json output from smartctl command. These json files are quite complex structure (deeply nested). The purpose of this notebook is to create a pandas dataframe object from one such json file. The reason for doing this is that it is much easier to analyze and play around with data when it is in a more "human friendly" and simplistic form such as a flat dataframe or csv.

NOTE: If you are not using the json-files-version of Bacblaze dataset at Red Hat, this notebook may not relevant for you.

In this notebook, first the data is parsed. Then it is flattened, i.e. deep-nested-ness of the data is removed. Next, if there are any duplicate or redundant columns, or columns that will not be available at inference time, these are also removed. Essentially, it is made to look exactly like a csv file from the Backblaze dataset.

[1]

import boto3
import pandas as pd

[2]

# Ceph data location
S3_ENDPOINT_URL = "https://s3.upshift.redhat.com/"
BUCKET_NAME = "DH-PLAYPEN"
PREFIX = "nwatkins/smart_data_Q4_2018"

[3]

# create a S3 client that will access Ceph
# NOTE: assumes aws credentials are stored in ~/.aws/credentials. if not, then must be passed as parameters
s3 = boto3.client("s3", "us-east-1", endpoint_url=S3_ENDPOINT_URL)

[4]

# keys of the data files in the Ceph bucket
data_s3_keys = [
    content["Key"]
    for content in s3.list_objects(Bucket=BUCKET_NAME, Prefix=PREFIX)["Contents"]
]

[5]

# get the first file
ret = s3.get_object(Bucket=BUCKET_NAME, Key=data_s3_keys[0])
df = pd.read_json(ret["Body"].read().decode("utf-8"), lines=True)

print(df.shape)
df.head()

(99636, 2)

	hints	smartctl_json
0	{'is_backblaze': True, 'backblaze_ts': 1538352...	{'model_name': 'ST4000DM000', 'serial_number':...
1	{'is_backblaze': True, 'backblaze_ts': 1538352...	{'model_name': 'ST12000NM0007', 'serial_number...
2	{'is_backblaze': True, 'backblaze_ts': 1538352...	{'model_name': 'ST12000NM0007', 'serial_number...
3	{'is_backblaze': True, 'backblaze_ts': 1538352...	{'model_name': 'HGST HMS5C4040ALE640', 'serial...
4	{'is_backblaze': True, 'backblaze_ts': 1538352...	{'model_name': 'ST8000NM0055', 'serial_number'...

[6]

# #################### USE THIS TO READ THE FILE LINE BY LINE #################### #

# # get the first json file
# ret = s3.get_object(Bucket=BUCKET_NAME, Key=data_s3_keys[0])

# i = 1
# MAX_NUM_LINES = 5
# data = None
# labels = None

# ########## MORE EFFICIENT METHOD
# # pd.concat([pd.DataFrame([i], columns=['A']) for i in range(5)],
# # ...           ignore_index=True)
# ##########

# ########## BUGFIX
# # labels = pd.concat([labels, pd.DataFrame.from_dict(tmp['hints'], orient='index').transpose()])
# ##########

# for line in ret['Body'].iter_lines():

#     if data is None and labels is None:
#         tmp = json.loads(line)
#         data = pd.DataFrame.from_dict(tmp['smartctl_json'], orient='index').transpose()
#         labels = pd.DataFrame.from_dict(tmp['hints'], orient='index').transpose()

#     else:
#         tmp = json.loads(line)
#         data = data.append(tmp['smartctl_json'], ignore_index=True)
#         labels = labels.append(tmp['hints'], ignore_index=True)

#     if i == MAX_NUM_LINES:
#         break
#     else:
#         i += 1

# labels.head()
# data.head()

[7]

# get labels (the "Y")
labels = df["hints"].apply(pd.Series)
labels.head()

	is_backblaze	backblaze_ts	backblaze_failure_label
0	True	1.538352e+12	False
1	True	1.538352e+12	False
2	True	1.538352e+12	False
3	True	1.538352e+12	False
4	True	1.538352e+12	False

[8]

# get data (the "X")
data = df["smartctl_json"].apply(pd.Series)
data.head()

	model_name	serial_number	model_family	user_capacity	ata_smart_attributes
0	ST4000DM000	Z305B2QN	ST4000DM000	{'bytes': 4000787030016}	{'table': [{'id': 1, 'value': 117, 'raw': {'va...
1	ST12000NM0007	ZJV0XJQ4	ST12000NM0007	{'bytes': 12000138625024}	{'table': [{'id': 1, 'value': 68, 'raw': {'val...
2	ST12000NM0007	ZJV0XJQ0	ST12000NM0007	{'bytes': 12000138625024}	{'table': [{'id': 1, 'value': 79, 'raw': {'val...
3	HGST HMS5C4040ALE640	PL1331LAHG1S4H	HGST HMS5C4040ALE640	{'bytes': 4000787030016}	{'table': [{'id': 1, 'value': 100, 'raw': {'va...
4	ST8000NM0055	ZA16NQJR	ST8000NM0055	{'bytes': 8001563222016}	{'table': [{'id': 1, 'value': 80, 'raw': {'val...

[9]

# change from dict type {'bytes': 123} to just int64 type 123
data["user_capacity"] = data["user_capacity"].apply(lambda x: x["bytes"])
data.head()

	model_name	serial_number	model_family	user_capacity	ata_smart_attributes
0	ST4000DM000	Z305B2QN	ST4000DM000	4000787030016	{'table': [{'id': 1, 'value': 117, 'raw': {'va...
1	ST12000NM0007	ZJV0XJQ4	ST12000NM0007	12000138625024	{'table': [{'id': 1, 'value': 68, 'raw': {'val...
2	ST12000NM0007	ZJV0XJQ0	ST12000NM0007	12000138625024	{'table': [{'id': 1, 'value': 79, 'raw': {'val...
3	HGST HMS5C4040ALE640	PL1331LAHG1S4H	HGST HMS5C4040ALE640	4000787030016	{'table': [{'id': 1, 'value': 100, 'raw': {'va...
4	ST8000NM0055	ZA16NQJR	ST8000NM0055	8001563222016	{'table': [{'id': 1, 'value': 80, 'raw': {'val...

[10]

# change from dict type {'table': [{}, {}, {}]}  to list type [{}, {}, {}]
data["ata_smart_attributes"] = data["ata_smart_attributes"].apply(lambda x: x["table"])
data.head()

	model_name	serial_number	model_family	user_capacity	ata_smart_attributes
0	ST4000DM000	Z305B2QN	ST4000DM000	4000787030016	[{'id': 1, 'value': 117, 'raw': {'value': 1485...
1	ST12000NM0007	ZJV0XJQ4	ST12000NM0007	12000138625024	[{'id': 1, 'value': 68, 'raw': {'value': 65293...
2	ST12000NM0007	ZJV0XJQ0	ST12000NM0007	12000138625024	[{'id': 1, 'value': 79, 'raw': {'value': 75775...
3	HGST HMS5C4040ALE640	PL1331LAHG1S4H	HGST HMS5C4040ALE640	4000787030016	[{'id': 1, 'value': 100, 'raw': {'value': 0, '...
4	ST8000NM0055	ZA16NQJR	ST8000NM0055	8001563222016	[{'id': 1, 'value': 80, 'raw': {'value': 10341...

[11]

# change from list of dict type [{}, {}, {}] to list of tuples type [(), (), ()]
data["ata_smart_attributes"] = data["ata_smart_attributes"].apply(
    lambda x: [tuple(kv.values()) for kv in x]
)
data.head()

	model_name	serial_number	model_family	user_capacity	ata_smart_attributes
0	ST4000DM000	Z305B2QN	ST4000DM000	4000787030016	[(1, 117, {'value': 148579464, 'string': 14857...
1	ST12000NM0007	ZJV0XJQ4	ST12000NM0007	12000138625024	[(1, 68, {'value': 6529384, 'string': 6529384}...
2	ST12000NM0007	ZJV0XJQ0	ST12000NM0007	12000138625024	[(1, 79, {'value': 75775832, 'string': 7577583...
3	HGST HMS5C4040ALE640	PL1331LAHG1S4H	HGST HMS5C4040ALE640	4000787030016	[(1, 100, {'value': 0, 'string': 0}), (2, 134....
4	ST8000NM0055	ZA16NQJR	ST8000NM0055	8001563222016	[(1, 80, {'value': 103416784, 'string': 103416...

[12]

# split list of tuples [(), (), ()] to columns
# NOTE: each list contains tuples corresponding to only those attributes which are non null
# for e.g. hdd1's list may look like [{'id': 252, 'value': 10, 'raw': {...}}, {'id': 187, 'value': 10, 'raw': {...}}]
# but hdd2's list may look like [{'id': 12, 'value': 5, 'raw': {...}}, {'id': 5, 'value': 10, 'raw': {...}}]
for row_idx in range(data.shape[0]):

    # get the stats for current drive
    stats = data.loc[row_idx]["ata_smart_attributes"]

    for smart_id, norm_val, raw_val_dict in stats:
        # extract normalized value, and int form of raw value
        data.at[row_idx, "smart_" + str(smart_id) + "_normalized"] = norm_val
        data.at[row_idx, "smart_" + str(smart_id) + "_raw"] = raw_val_dict["value"]

# drop the original column as we have parsed its information and don't need it any more
data.drop("ata_smart_attributes", axis=1, inplace=True)
data.head()

	model_name	serial_number	model_family	user_capacity	smart_1_normalized	smart_1_raw	smart_3_normalized	smart_3_raw	smart_4_normalized	smart_4_raw	...	smart_23_normalized	smart_23_raw	smart_24_normalized	smart_24_raw	smart_250_normalized	smart_250_raw	smart_251_normalized	smart_251_raw	smart_252_normalized	smart_252_raw
0	ST4000DM000	Z305B2QN	ST4000DM000	4000787030016	117.0	148579464.0	91.0	0.0	100.0	12.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	ST12000NM0007	ZJV0XJQ4	ST12000NM0007	12000138625024	68.0	6529384.0	98.0	0.0	100.0	2.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	ST12000NM0007	ZJV0XJQ0	ST12000NM0007	12000138625024	79.0	75775832.0	99.0	0.0	100.0	1.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	HGST HMS5C4040ALE640	PL1331LAHG1S4H	HGST HMS5C4040ALE640	4000787030016	100.0	0.0	100.0	436.0	100.0	8.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	ST8000NM0055	ZA16NQJR	ST8000NM0055	8001563222016	80.0	103416784.0	92.0	0.0	100.0	6.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 90 columns

[13]

# are all the timestamps same for a json file
(labels["backblaze_ts"] == 1538352000000.0).all()

True

[14]

# are these two columns different for any row?
(data["model_name"] != data["model_family"]).any()

False

[15]

# drop duplicate columns
data.drop(labels=["model_family"], axis=1, inplace=True)
data.head()

	model_name	serial_number	user_capacity	smart_1_normalized	smart_1_raw	smart_3_normalized	smart_3_raw	smart_4_normalized	smart_4_raw	smart_5_normalized	...	smart_23_normalized	smart_23_raw	smart_24_normalized	smart_24_raw	smart_250_normalized	smart_250_raw	smart_251_normalized	smart_251_raw	smart_252_normalized	smart_252_raw
0	ST4000DM000	Z305B2QN	4000787030016	117.0	148579464.0	91.0	0.0	100.0	12.0	100.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	ST12000NM0007	ZJV0XJQ4	12000138625024	68.0	6529384.0	98.0	0.0	100.0	2.0	100.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	ST12000NM0007	ZJV0XJQ0	12000138625024	79.0	75775832.0	99.0	0.0	100.0	1.0	100.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	HGST HMS5C4040ALE640	PL1331LAHG1S4H	4000787030016	100.0	0.0	100.0	436.0	100.0	8.0	100.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	ST8000NM0055	ZA16NQJR	8001563222016	80.0	103416784.0	92.0	0.0	100.0	6.0	100.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 89 columns

NOTE: The datetime column can optionally be added to data, but first it must be ensured that the SMART stat fetching tool (e.g. smartctl) provides this

[16]

bbdf = pd.read_csv("/home/kachauha/Downloads/data_Q4_2018/2018-10-01.csv")
bbdf.head()

	date	serial_number	model	capacity_bytes	smart_1_normalized	smart_1_raw	smart_2_normalized	smart_2_raw	smart_3_normalized	...	smart_250_normalized	smart_250_raw	smart_251_normalized	smart_251_raw	smart_252_normalized	smart_252_raw	smart_254_normalized	smart_254_raw	smart_255_normalized	smart_255_raw
0	2018-10-01	Z305B2QN	ST4000DM000	4000787030016	117	148579464	NaN	NaN	91	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	2018-10-01	ZJV0XJQ4	ST12000NM0007	12000138625024	68	6529384	NaN	NaN	98	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	2018-10-01	ZJV0XJQ0	ST12000NM0007	12000138625024	79	75775832	NaN	NaN	99	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	2018-10-01	PL1331LAHG1S4H	HGST HMS5C4040ALE640	4000787030016	100	0	134.0	103.0	100	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	2018-10-01	ZA16NQJR	ST8000NM0055	8001563222016	80	103416784	NaN	NaN	92	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 129 columns

[ ]