ODH Logo

Step 0: json to pandas-dataframe

At Red Hat, Backblaze Q4 2018 dataset files (.csv) were downloaded and converted to look like the json output from smartctl command. These json files are quite complex structure (deeply nested). The purpose of this notebook is to create a pandas dataframe object from one such json file. The reason for doing this is that it is much easier to analyze and play around with data when it is in a more "human friendly" and simplistic form such as a flat dataframe or csv.

NOTE: If you are not using the json-files-version of Bacblaze dataset at Red Hat, this notebook may not relevant for you.

In this notebook, first the data is parsed. Then it is flattened, i.e. deep-nested-ness of the data is removed. Next, if there are any duplicate or redundant columns, or columns that will not be available at inference time, these are also removed. Essentially, it is made to look exactly like a csv file from the Backblaze dataset.

[1]
import boto3
import pandas as pd
[2]
# Ceph data location
S3_ENDPOINT_URL = "https://s3.upshift.redhat.com/"
BUCKET_NAME = "DH-PLAYPEN"
PREFIX = "nwatkins/smart_data_Q4_2018"
[3]
# create a S3 client that will access Ceph
# NOTE: assumes aws credentials are stored in ~/.aws/credentials. if not, then must be passed as parameters
s3 = boto3.client("s3", "us-east-1", endpoint_url=S3_ENDPOINT_URL)
[4]
# keys of the data files in the Ceph bucket
data_s3_keys = [
    content["Key"]
    for content in s3.list_objects(Bucket=BUCKET_NAME, Prefix=PREFIX)["Contents"]
]
[5]
# get the first file
ret = s3.get_object(Bucket=BUCKET_NAME, Key=data_s3_keys[0])
df = pd.read_json(ret["Body"].read().decode("utf-8"), lines=True)

print(df.shape)
df.head()
(99636, 2)
hints smartctl_json
0 {'is_backblaze': True, 'backblaze_ts': 1538352... {'model_name': 'ST4000DM000', 'serial_number':...
1 {'is_backblaze': True, 'backblaze_ts': 1538352... {'model_name': 'ST12000NM0007', 'serial_number...
2 {'is_backblaze': True, 'backblaze_ts': 1538352... {'model_name': 'ST12000NM0007', 'serial_number...
3 {'is_backblaze': True, 'backblaze_ts': 1538352... {'model_name': 'HGST HMS5C4040ALE640', 'serial...
4 {'is_backblaze': True, 'backblaze_ts': 1538352... {'model_name': 'ST8000NM0055', 'serial_number'...
[6]
# #################### USE THIS TO READ THE FILE LINE BY LINE #################### #

# # get the first json file
# ret = s3.get_object(Bucket=BUCKET_NAME, Key=data_s3_keys[0])

# i = 1
# MAX_NUM_LINES = 5
# data = None
# labels = None

# ########## MORE EFFICIENT METHOD
# # pd.concat([pd.DataFrame([i], columns=['A']) for i in range(5)],
# # ...           ignore_index=True)
# ##########

# ########## BUGFIX
# # labels = pd.concat([labels, pd.DataFrame.from_dict(tmp['hints'], orient='index').transpose()])
# ##########

# for line in ret['Body'].iter_lines():

#     if data is None and labels is None:
#         tmp = json.loads(line)
#         data = pd.DataFrame.from_dict(tmp['smartctl_json'], orient='index').transpose()
#         labels = pd.DataFrame.from_dict(tmp['hints'], orient='index').transpose()

#     else:
#         tmp = json.loads(line)
#         data = data.append(tmp['smartctl_json'], ignore_index=True)
#         labels = labels.append(tmp['hints'], ignore_index=True)

#     if i == MAX_NUM_LINES:
#         break
#     else:
#         i += 1

# labels.head()
# data.head()
[7]
# get labels (the "Y")
labels = df["hints"].apply(pd.Series)
labels.head()
is_backblaze backblaze_ts backblaze_failure_label
0 True 1.538352e+12 False
1 True 1.538352e+12 False
2 True 1.538352e+12 False
3 True 1.538352e+12 False
4 True 1.538352e+12 False
[8]
# get data (the "X")
data = df["smartctl_json"].apply(pd.Series)
data.head()
model_name serial_number model_family user_capacity ata_smart_attributes
0 ST4000DM000 Z305B2QN ST4000DM000 {'bytes': 4000787030016} {'table': [{'id': 1, 'value': 117, 'raw': {'va...
1 ST12000NM0007 ZJV0XJQ4 ST12000NM0007 {'bytes': 12000138625024} {'table': [{'id': 1, 'value': 68, 'raw': {'val...
2 ST12000NM0007 ZJV0XJQ0 ST12000NM0007 {'bytes': 12000138625024} {'table': [{'id': 1, 'value': 79, 'raw': {'val...
3 HGST HMS5C4040ALE640 PL1331LAHG1S4H HGST HMS5C4040ALE640 {'bytes': 4000787030016} {'table': [{'id': 1, 'value': 100, 'raw': {'va...
4 ST8000NM0055 ZA16NQJR ST8000NM0055 {'bytes': 8001563222016} {'table': [{'id': 1, 'value': 80, 'raw': {'val...
[9]
# change from dict type {'bytes': 123} to just int64 type 123
data["user_capacity"] = data["user_capacity"].apply(lambda x: x["bytes"])
data.head()
model_name serial_number model_family user_capacity ata_smart_attributes
0 ST4000DM000 Z305B2QN ST4000DM000 4000787030016 {'table': [{'id': 1, 'value': 117, 'raw': {'va...
1 ST12000NM0007 ZJV0XJQ4 ST12000NM0007 12000138625024 {'table': [{'id': 1, 'value': 68, 'raw': {'val...
2 ST12000NM0007 ZJV0XJQ0 ST12000NM0007 12000138625024 {'table': [{'id': 1, 'value': 79, 'raw': {'val...
3 HGST HMS5C4040ALE640 PL1331LAHG1S4H HGST HMS5C4040ALE640 4000787030016 {'table': [{'id': 1, 'value': 100, 'raw': {'va...
4 ST8000NM0055 ZA16NQJR ST8000NM0055 8001563222016 {'table': [{'id': 1, 'value': 80, 'raw': {'val...
[10]
# change from dict type {'table': [{}, {}, {}]}  to list type [{}, {}, {}]
data["ata_smart_attributes"] = data["ata_smart_attributes"].apply(lambda x: x["table"])
data.head()
model_name serial_number model_family user_capacity ata_smart_attributes
0 ST4000DM000 Z305B2QN ST4000DM000 4000787030016 [{'id': 1, 'value': 117, 'raw': {'value': 1485...
1 ST12000NM0007 ZJV0XJQ4 ST12000NM0007 12000138625024 [{'id': 1, 'value': 68, 'raw': {'value': 65293...
2 ST12000NM0007 ZJV0XJQ0 ST12000NM0007 12000138625024 [{'id': 1, 'value': 79, 'raw': {'value': 75775...
3 HGST HMS5C4040ALE640 PL1331LAHG1S4H HGST HMS5C4040ALE640 4000787030016 [{'id': 1, 'value': 100, 'raw': {'value': 0, '...
4 ST8000NM0055 ZA16NQJR ST8000NM0055 8001563222016 [{'id': 1, 'value': 80, 'raw': {'value': 10341...
[11]
# change from list of dict type [{}, {}, {}] to list of tuples type [(), (), ()]
data["ata_smart_attributes"] = data["ata_smart_attributes"].apply(
    lambda x: [tuple(kv.values()) for kv in x]
)
data.head()
model_name serial_number model_family user_capacity ata_smart_attributes
0 ST4000DM000 Z305B2QN ST4000DM000 4000787030016 [(1, 117, {'value': 148579464, 'string': 14857...
1 ST12000NM0007 ZJV0XJQ4 ST12000NM0007 12000138625024 [(1, 68, {'value': 6529384, 'string': 6529384}...
2 ST12000NM0007 ZJV0XJQ0 ST12000NM0007 12000138625024 [(1, 79, {'value': 75775832, 'string': 7577583...
3 HGST HMS5C4040ALE640 PL1331LAHG1S4H HGST HMS5C4040ALE640 4000787030016 [(1, 100, {'value': 0, 'string': 0}), (2, 134....
4 ST8000NM0055 ZA16NQJR ST8000NM0055 8001563222016 [(1, 80, {'value': 103416784, 'string': 103416...
[12]
# split list of tuples [(), (), ()] to columns
# NOTE: each list contains tuples corresponding to only those attributes which are non null
# for e.g. hdd1's list may look like [{'id': 252, 'value': 10, 'raw': {...}}, {'id': 187, 'value': 10, 'raw': {...}}]
# but hdd2's list may look like [{'id': 12, 'value': 5, 'raw': {...}}, {'id': 5, 'value': 10, 'raw': {...}}]
for row_idx in range(data.shape[0]):

    # get the stats for current drive
    stats = data.loc[row_idx]["ata_smart_attributes"]

    for smart_id, norm_val, raw_val_dict in stats:
        # extract normalized value, and int form of raw value
        data.at[row_idx, "smart_" + str(smart_id) + "_normalized"] = norm_val
        data.at[row_idx, "smart_" + str(smart_id) + "_raw"] = raw_val_dict["value"]

# drop the original column as we have parsed its information and don't need it any more
data.drop("ata_smart_attributes", axis=1, inplace=True)
data.head()
model_name serial_number model_family user_capacity smart_1_normalized smart_1_raw smart_3_normalized smart_3_raw smart_4_normalized smart_4_raw ... smart_23_normalized smart_23_raw smart_24_normalized smart_24_raw smart_250_normalized smart_250_raw smart_251_normalized smart_251_raw smart_252_normalized smart_252_raw
0 ST4000DM000 Z305B2QN ST4000DM000 4000787030016 117.0 148579464.0 91.0 0.0 100.0 12.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 ST12000NM0007 ZJV0XJQ4 ST12000NM0007 12000138625024 68.0 6529384.0 98.0 0.0 100.0 2.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 ST12000NM0007 ZJV0XJQ0 ST12000NM0007 12000138625024 79.0 75775832.0 99.0 0.0 100.0 1.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 HGST HMS5C4040ALE640 PL1331LAHG1S4H HGST HMS5C4040ALE640 4000787030016 100.0 0.0 100.0 436.0 100.0 8.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 ST8000NM0055 ZA16NQJR ST8000NM0055 8001563222016 80.0 103416784.0 92.0 0.0 100.0 6.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 90 columns

[13]
# are all the timestamps same for a json file
(labels["backblaze_ts"] == 1538352000000.0).all()
True
[14]
# are these two columns different for any row?
(data["model_name"] != data["model_family"]).any()
False
[15]
# drop duplicate columns
data.drop(labels=["model_family"], axis=1, inplace=True)
data.head()
model_name serial_number user_capacity smart_1_normalized smart_1_raw smart_3_normalized smart_3_raw smart_4_normalized smart_4_raw smart_5_normalized ... smart_23_normalized smart_23_raw smart_24_normalized smart_24_raw smart_250_normalized smart_250_raw smart_251_normalized smart_251_raw smart_252_normalized smart_252_raw
0 ST4000DM000 Z305B2QN 4000787030016 117.0 148579464.0 91.0 0.0 100.0 12.0 100.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 ST12000NM0007 ZJV0XJQ4 12000138625024 68.0 6529384.0 98.0 0.0 100.0 2.0 100.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 ST12000NM0007 ZJV0XJQ0 12000138625024 79.0 75775832.0 99.0 0.0 100.0 1.0 100.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 HGST HMS5C4040ALE640 PL1331LAHG1S4H 4000787030016 100.0 0.0 100.0 436.0 100.0 8.0 100.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 ST8000NM0055 ZA16NQJR 8001563222016 80.0 103416784.0 92.0 0.0 100.0 6.0 100.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 89 columns

NOTE: The datetime column can optionally be added to data, but first it must be ensured that the SMART stat fetching tool (e.g. smartctl) provides this

[16]
bbdf = pd.read_csv("/home/kachauha/Downloads/data_Q4_2018/2018-10-01.csv")
bbdf.head()
date serial_number model capacity_bytes failure smart_1_normalized smart_1_raw smart_2_normalized smart_2_raw smart_3_normalized ... smart_250_normalized smart_250_raw smart_251_normalized smart_251_raw smart_252_normalized smart_252_raw smart_254_normalized smart_254_raw smart_255_normalized smart_255_raw
0 2018-10-01 Z305B2QN ST4000DM000 4000787030016 0 117 148579464 NaN NaN 91 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2018-10-01 ZJV0XJQ4 ST12000NM0007 12000138625024 0 68 6529384 NaN NaN 98 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2018-10-01 ZJV0XJQ0 ST12000NM0007 12000138625024 0 79 75775832 NaN NaN 99 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 2018-10-01 PL1331LAHG1S4H HGST HMS5C4040ALE640 4000787030016 0 100 0 134.0 103.0 100 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 2018-10-01 ZA16NQJR ST8000NM0055 8001563222016 0 80 103416784 NaN NaN 92 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 129 columns

[ ]